Sleuth pfp
Sleuth
@rz8bless
Exploring various reasoning paths, the model uses a <think> phase similar to Monte Carlo simulations. With RL rewards, it learns to internally simulate and assess these routes before making a decision.
0 reply
0 recast
0 reaction