Let's say you have a corpus of text — 10 million words — about a specific topic.

1. What's the best way to "train a model" on that text?

2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?

Not sure what use cases you are trying to solve and I don't know if it scales to 10 million words, but NotebookLM https://notebooklm.google/ from Google does much of what you are asking I suspect... unless you are looking to do something more automated and ongoing!