Content pfp
Content
@
0 reply
0 recast
0 reaction

Abbeyunique 🎩 🎭Ⓜ️ pfp
Abbeyunique 🎩 🎭Ⓜ️
@abbeyunique
DeepSeek-R1 just released. Fully open source & transparent with MIT license. Developed with reinforcement learning directly on the base model. 20-30x cheaper API at comparable performance to OAI’s o1. (USD) 0.14 / million input tokens (cache hit). 0.55 / million input tokens (cache miss). 2.19 / million output tokens. To simplify, R1 is like R1-Zero but with multi-stage training: Its pipeline: • Fine-tune base with CoT data points • RL stage similar to R1-Zero • SFT using ~600k data points from rejection sampling and supervised datasets (e.g., writing, self-cognition). • RL stage to optimise objectives: helpfulness, harm reduction, etc. Emergent properties like longer responses, reflection, and alternative exploration emerge as natural products during training without explicit programming. RL rewards focus: • Accuracy (e.g., unit test-based scoring for code). • Format (e.g., tags for reasoning separation and language consistency). No outcome/process RMs used (simpler).
1 reply
0 recast
0 reaction

Abu🎩 pfp
Abu🎩
@abushaymau
2500 $degen
1 reply
0 recast
0 reaction