July on Warpcast

Content pfp

https://warpcast.com/~/channel/aichannel

0 reply

0 recast

0 reaction

July pfp

My take on this: - 6~8 x H100 was required for me (A100 did not cut it) - FP8, FlashAttention is pretty cool! (torch) I need to read more about it - Context length ~ 30K ish is what I think I got to - tried ollama, vLLM - Q4_K_M was good, Q5_K_M, Q6_K_M didn't work - got to about ~20 tok/s - lambda labs was smoothest setup for me On the model itself - full R1-671B is a night and day different from distill 14B, 32B, 70B-llama etc for me - It also feels... very asian. the model feels like an asian parent sometimes. it says stuff (even in english) that people I know would say - claude is a lot more empathetic (a syncophant, even), gpt4 is a bit of a nice happy but rational and emotionally absent-minded techbro - the internal <think> with R1 feels like a teacher, that sort of looks down on you (again, I think its the confucian teacher/student relationship) - but it responds in a llama/chatgpt like "hey! thanks for asking!" sort of sf bay area tech vibes - overall great - really sharp at thinking through ideas

3 replies

1 recast

38 reactions

agusti pfp

u think might be worth it go for 2 mac m2 ultra to run locally w exo?

1 reply

0 recast

1 reaction

July pfp

no - not even close from back of the napkin calcs i did: you'd have to run lower than Q4_K_M, some lower quantization on 4~6 x Mac M4 studio, you'd still get poorer performance than A100 setup

2 replies

0 recast

5 reactions

July pfp

MPS metal isn't save level as CUDA, inter M4 bandwidth is slow also they are expensive, might as well rent from the cloud or just purchase your own H100s, etc

1 reply

0 recast

2 reactions

Sdam Amith pfp

the cpu can’t compete however mac studio ram is thought to be competitive, are you saying that’s not true? https://x.com/alexocheema/status/1884017521985995178?s=46

1 reply

0 recast

1 reaction