Content pfp
Content
@
0 reply
0 recast
0 reaction

Dan Romero pfp
Dan Romero
@dwr.eth
Let's say you have a corpus of text — 10 million words — about a specific topic. 1. What's the best way to "train a model" on that text? 2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?
18 replies
2 recasts
118 reactions

Daniel - Bountycaster pfp
Daniel - Bountycaster
@pirosb3
What should the model do? would this be an instruction-based model (answer questions - similar to ChatGPT)?
1 reply
0 recast
0 reaction

Dan Romero pfp
Dan Romero
@dwr.eth
Yeah ability to give you answers based on what is in the corpus but nothing else
4 replies
0 recast
0 reaction

Daniel - Bountycaster pfp
Daniel - Bountycaster
@pirosb3
NOTE: not an expert in any way What I'd do: - Index the documents in a vector database - 10M words won't fit in 1 prompt - Use popular RAG methods to 1) retrieve relevant documents upon query and 2) generate the answers by including those documents in the prompt
1 reply
0 recast
2 reactions

ash pfp
ash
@aes
I would > create an endpoint to a S3 bucket with the text / resources you want to interact with > create a GPT action that uses the endpoint to access the text > use chatGPT4o interface to "talk" with documents OR use Brev.dev to fine-tune an open source model like Mistral 7B on your text
0 reply
0 recast
0 reaction

Marwan ♋️ pfp
Marwan ♋️
@marwan1337
There's a tool called PrivateGPT that does exactly this. It won't answer anything not based on the text provided. https://docs.privategpt.dev/overview/welcome/introduction
0 reply
0 recast
1 reaction