People wonder why Llama has knowledge cut off. The model boasts 15T tokens and requires 1.3M H100 hours of training. On a 16k H100 cluster, it takes roughly 4 days to train from the ground up. 🚀

Later this year, training should be twice as speedy on the same H100s, given the current 400 Tflops/GPU performance.

Developer (Haskell, Rust) / Decentralized AI / Web3 Data Analytics

However, making it continuous still is very very difficult. 😓Thus until we reach appropriate hw availability it always will be a knowledge gap. Especially with amount of information generated daily raising daily. 👀