Content pfp
Content
@
https://warpcast.com/~/channel/aichannel
0 reply
0 recast
0 reaction

July pfp
July
@july
My take on this: - 6~8 x H100 was required for me (A100 did not cut it) - FP8, FlashAttention is pretty cool! (torch) I need to read more about it - Context length ~ 30K ish is what I think I got to - tried ollama, vLLM - Q4_K_M was good, Q5_K_M, Q6_K_M didn't work - got to about ~20 tok/s - lambda labs was smoothest setup for me On the model itself - full R1-671B is a night and day different from distill 14B, 32B, 70B-llama etc for me - It also feels... very asian. the model feels like an asian parent sometimes. it says stuff (even in english) that people I know would say - claude is a lot more empathetic (a syncophant, even), gpt4 is a bit of a nice happy but rational and emotionally absent-minded techbro - the internal <think> with R1 feels like a teacher, that sort of looks down on you (again, I think its the confucian teacher/student relationship) - but it responds in a llama/chatgpt like "hey! thanks for asking!" sort of sf bay area tech vibes - overall great - really sharp at thinking through ideas
3 replies
1 recast
38 reactions

July pfp
July
@july
correction: - I meant to say "FlexAttention" (not FlashAttention) - I'm particularly curious to use it potentially to implement custom attention mechanisms that optimize memory usage and compute usage -- which means drive down costs https://pytorch.org/blog/flexattention/
0 reply
0 recast
10 reactions

agusti pfp
agusti
@bleu.eth
u think might be worth it go for 2 mac m2 ultra to run locally w exo?
1 reply
0 recast
1 reaction

ܙܟܪܝܐ’s🧠 pfp
ܙܟܪܝܐ’s🧠
@zef
Spot on re full vs distilled models. TogetherAI have a pretty good implementation (but at $7/1M tok last i checked…)
1 reply
0 recast
1 reaction