neon on Warpcast

Content pfp

https://warpcast.com/~/channel/aichannel

0 reply

0 recast

0 reaction

July pfp

My take on this: - 6~8 x H100 was required for me (A100 did not cut it) - FP8, FlashAttention is pretty cool! (torch) I need to read more about it - Context length ~ 30K ish is what I think I got to - tried ollama, vLLM - Q4_K_M was good, Q5_K_M, Q6_K_M didn't work - got to about ~20 tok/s - lambda labs was smoothest setup for me On the model itself - full R1-671B is a night and day different from distill 14B, 32B, 70B-llama etc for me - It also feels... very asian. the model feels like an asian parent sometimes. it says stuff (even in english) that people I know would say - claude is a lot more empathetic (a syncophant, even), gpt4 is a bit of a nice happy but rational and emotionally absent-minded techbro - the internal <think> with R1 feels like a teacher, that sort of looks down on you (again, I think its the confucian teacher/student relationship) - but it responds in a llama/chatgpt like "hey! thanks for asking!" sort of sf bay area tech vibes - overall great - really sharp at thinking through ideas

3 replies

1 recast

38 reactions

agusti pfp

u think might be worth it go for 2 mac m2 ultra to run locally w exo?

1 reply

0 recast

1 reaction

July pfp

no - not even close from back of the napkin calcs i did: you'd have to run lower than Q4_K_M, some lower quantization on 4~6 x Mac M4 studio, you'd still get poorer performance than A100 setup

2 replies

0 recast

5 reactions

Sdam Amith pfp

the cpu can’t compete however mac studio ram is thought to be competitive, are you saying that’s not true? https://x.com/alexocheema/status/1884017521985995178?s=46

1 reply

0 recast

1 reaction

neon pfp

0 reply

0 recast

0 reaction