base-seoul

#dailychallenge

Research on DeepSeek r1 _2

* The cost-saving information related to training was already announced in the V3 paper released last Christmas, not in the R1 model.
* While MoE (Mixture of Experts) was implemented starting from V2, meaningful results only began to appear with V3.
* DeepSeekMoE, referring to Mixture of Experts, activates only the experts relevant to a specific topic. In contrast, models like GPT-3.5 activate the entire network during both training and inference, regardless of the token input.
* The total cost of $5,576,000 includes only the final training phase. Expenses for model structure design, algorithm development, data preparation, preliminary research, and comparative experiments were excluded.
* It is presumed that DeepSeek distilled 4o and Sonnet to generate training tokens.