Content pfp
Content
@
https://warpcast.com/~/channel/aichannel
0 reply
0 recast
0 reaction

July pfp
July
@july
My take on this: - 6~8 x H100 was required for me (A100 did not cut it) - FP8, FlashAttention is pretty cool! (torch) I need to read more about it - Context length ~ 30K ish is what I think I got to - tried ollama, vLLM - Q4_K_M was good, Q5_K_M, Q6_K_M didn't work - got to about ~20 tok/s - lambda labs was smoothest setup for me On the model itself - full R1-671B is a night and day different from distill 14B, 32B, 70B-llama etc for me - It also feels... very asian. the model feels like an asian parent sometimes. it says stuff (even in english) that people I know would say - claude is a lot more empathetic (a syncophant, even), gpt4 is a bit of a nice happy but rational and emotionally absent-minded techbro - the internal <think> with R1 feels like a teacher, that sort of looks down on you (again, I think its the confucian teacher/student relationship) - but it responds in a llama/chatgpt like "hey! thanks for asking!" sort of sf bay area tech vibes - overall great - really sharp at thinking through ideas
3 replies
1 recast
38 reactions

agusti pfp
agusti
@bleu.eth
u think might be worth it go for 2 mac m2 ultra to run locally w exo?
1 reply
0 recast
1 reaction

July pfp
July
@july
no - not even close from back of the napkin calcs i did: you'd have to run lower than Q4_K_M, some lower quantization on 4~6 x Mac M4 studio, you'd still get poorer performance than A100 setup
2 replies
0 recast
5 reactions

Sdam Amith pfp
Sdam Amith
@sdamamith
the cpu can’t compete however mac studio ram is thought to be competitive, are you saying that’s not true? https://x.com/alexocheema/status/1884017521985995178?s=46
1 reply
0 recast
1 reaction

neon pfp
neon
@neonrover
also curious
0 reply
0 recast
0 reaction