Let's say you have a corpus of text — 10 million words — about a specific topic.

1. What's the best way to "train a model" on that text?

2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?

Those 10M words should be in natural language, then
*use https://www.trychroma.com/ to push data and connect with LLM. 
* create set of QA to verify satisfaction
* create more data from existing corpus to reduce hallucination
Interested in PoC for this, LMK