Let's say you have a corpus of text — 10 million words — about a specific topic.

1. What's the best way to "train a model" on that text?

2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?

Working on Farcaster and Warpcast. Longer thoughts at https://dwr.email

you want to lean heavily on retrieval augmented generation (RAG), let me follow up with some resources 

working on something similar albeit a smaller corpus for a muni

not 1:1 relevant but you might find it helpful in reframing and refining the objective

in any case you would want to rely on fine tuning more for boosting prompts or driving a specific use case with what you are asking the RAG model to dig up

definitely need a vector db, chroma is solid. would be happy to connect you with the team im working with @dwr.eth

not 1:1 relevant but you might find it helpful in reframing and refining the objective

https://github.com/daveshap/SparsePrimingRepresentations

https://medium.com/@dave-shap/beyond-vector-search-knowledge-management-with-generative-ai-6c2d10b481a0