Let's say you have a corpus of text — 10 million words — about a specific topic.

1. What's the best way to "train a model" on that text?

2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?

What should the model do? would this be an instruction-based model (answer questions - similar to ChatGPT)?

Hi! My name is Daniel. I am building bountycaster.xyz

Ex Phantom, 0x Project engineer.

From Italy originally, currently live in NYC

Yeah ability to give you answers based on what is in the corpus but nothing else

NOTE: not an expert in any way

What I'd do:
- Index the documents in a vector database - 10M words won't fit in 1 prompt
- Use popular RAG methods to 1) retrieve relevant documents upon query and 2) generate the answers by including those documents in the prompt

this is the most popular approach to achieve what you want to do @dwr.eth

if your corpus is PDF documents iirc you can easily do this with a custom GPT on chatgpt

doing it manually will take a bit more effort and some command line tinkering - resources in next cast ->