Let's say you have a corpus of text — 10 million words — about a specific topic.

1. What's the best way to "train a model" on that text?

2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?

Working on Farcaster and Warpcast. Longer thoughts at https://dwr.email

That’s simply not a large enough corpus for (present-day) ML systems to derive useful comprehension of the language that the text is written in. So your only choice, if you want useful language generation, is to fine-tune an existing model to take advantage of language capabilities derived from a much larger dataset.