Content pfp
Content
@
https://warpcast.com/~/channel/aichannel
0 reply
0 recast
0 reaction

shoni.eth pfp
shoni.eth
@alexpaden
Here are the concise allocations of training data for the specified models (most don't share this): **Llama 3.1:** - 50% General Knowledge - 25% Mathematical and Reasoning - 17% Code - 8% Multilingual **BLOOM's ROOTS Corpus:** - 30% English - 0.00002% Chi Tumbuka (obscure lang) - 38% from OSCAR Corpus - 62% from Manually Selected Sources https://www.sequoiacap.com/podcast/training-data-joe-spisak/ BLOOM is a 176B-parameter multilingual model by BigScience (huggingface), trained on ROOTS (~1.6TB, ~350B tokens, 46 languages, 13 programming languages).
1 reply
0 recast
4 reactions

shoni.eth pfp
shoni.eth
@alexpaden
Llama 3.1, trained on 15T tokens (cutoff Dec 2023), uses 50% general knowledge, 25% math/reasoning, 17% code, and 8% multilingual data. Fine-tuning added 25M synthetic examples—e.g., 2.7M coding dialogues—generated partly by the 405B model itself for skills like coding, multilingual support, and long-context tasks (128K-token window). Vocabulary is 128,256 tokens; throughput hits 457 tokens/s (70B) and 129 tokens/s (405B). Preprocessing filtered NSFW and duplicates, using 16K H100 GPUs at 400 TFLOPS each. Compared to Llama 3 (95% English), it’s more diverse, but synthetic data may introduce biases.
0 reply
0 recast
0 reaction