0 reply
0 recast
0 reaction
![July pfp](https://i.seadn.io/gcs/files/ed56e6b9a1b22720ce7490524db333e0.jpg?w=500&auto=format)
My take on this:
- 6~8 x H100 was required for me (A100 did not cut it)
- FP8, FlashAttention is pretty cool! (torch) I need to read more about it
- Context length ~ 30K ish is what I think I got to
- tried ollama, vLLM
- Q4_K_M was good, Q5_K_M, Q6_K_M didn't work
- got to about ~20 tok/s
- lambda labs was smoothest setup for me
On the model itself
- full R1-671B is a night and day different from distill 14B, 32B, 70B-llama etc for me
- It also feels... very asian. the model feels like an asian parent sometimes. it says stuff (even in english) that people I know would say
- claude is a lot more empathetic (a syncophant, even), gpt4 is a bit of a nice happy but rational and emotionally absent-minded techbro
- the internal <think> with R1 feels like a teacher, that sort of looks down on you (again, I think its the confucian teacher/student relationship) - but it responds in a llama/chatgpt like "hey! thanks for asking!" sort of sf bay area tech vibes
- overall great - really sharp at thinking through ideas 3 replies
1 recast
38 reactions
1 reply
0 recast
1 reaction
2 replies
0 recast
5 reactions
1 reply
0 recast
2 reactions
1 reply
0 recast
1 reaction