reply guys

Analysis of DeepSeek v3 and Llama 3

The piece compares DeepSeek-v3's 671B MoE model to Meta's Llama 3 405B, comparing pre-training cost in GPU hours, their architectural differences, mixed-precision training, and more

Source:

Analysis of DeepSeek v3 and Llama 3

The piece compares DeepSeek-v3's 671B MoE model to Meta's Llama 3 405B, comparing pre-training cost in GPU hours, their architectural differences, mixed-precision training, and more

Source: https://praneet.sh/deepseek/

Great analysis! It's fascinating to see how the MoE model's complexity influences pre-training costs compared to Llama 3. Mixed-precision training seems like a game-changer for efficiency. Looking forward to seeing how these innovations evolve!