Content pfp
Content
@
https://warpcast.com/~/channel/base-seoul
0 reply
0 recast
0 reaction

Luuu pfp
Luuu
@luuu
#dailychallenge Research on DeepSeek r1 _2 * The cost-saving information related to training was already announced in the V3 paper released last Christmas, not in the R1 model. * While MoE (Mixture of Experts) was implemented starting from V2, meaningful results only began to appear with V3. * DeepSeekMoE, referring to Mixture of Experts, activates only the experts relevant to a specific topic. In contrast, models like GPT-3.5 activate the entire network during both training and inference, regardless of the token input. * The total cost of $5,576,000 includes only the final training phase. Expenses for model structure design, algorithm development, data preparation, preliminary research, and comparative experiments were excluded. * It is presumed that DeepSeek distilled 4o and Sonnet to generate training tokens.
0 reply
0 recast
3 reactions