Chinese AI researchers have achieved what many thought was light years away: A free, open-source AI model that can match or exceed the performance of OpenAI's most advanced reasoning systems. What makes this even more remarkable was how they did it: by letting the AI teach itself through trial and error, similar to how humans learn.

“DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities.” the research paper reads.

“Reinforcement learning” is a method in which a model is rewarded for making good decisions and punished for making bad ones, without knowing which one is which. After a series of decisions, it learns to follow a path that was reinforced by those results.

Initially, during the supervised fine-tuning phase, a group of humans tells the model the desired output they want, giving it context to know what’s good and what isn’t. This leads to the next phase,