Dan Romero on Warpcast

Content pfp

0 reply

0 recast

0 reaction

Dan Romero pfp

Let's say you have a corpus of text — 10 million words — about a specific topic. 1. What's the best way to "train a model" on that text? 2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?

17 replies

2 recasts

67 reactions

Daniel - Bountycaster pfp

Daniel - Bountycaster

What should the model do? would this be an instruction-based model (answer questions - similar to ChatGPT)?

1 reply

0 recast

0 reaction

Dan Romero pfp

Yeah ability to give you answers based on what is in the corpus but nothing else

4 replies

0 recast

0 reaction

Daniel - Bountycaster pfp

Daniel - Bountycaster

NOTE: not an expert in any way What I'd do: - Index the documents in a vector database - 10M words won't fit in 1 prompt - Use popular RAG methods to 1) retrieve relevant documents upon query and 2) generate the answers by including those documents in the prompt

1 reply

0 recast

2 reactions

ash pfp

I would > create an endpoint to a S3 bucket with the text / resources you want to interact with > create a GPT action that uses the endpoint to access the text > use chatGPT4o interface to "talk" with documents OR use Brev.dev to fine-tune an open source model like Mistral 7B on your text

0 reply

0 recast

0 reaction

Marwan ♋️ pfp

There's a tool called PrivateGPT that does exactly this. It won't answer anything not based on the text provided. https://docs.privategpt.dev/overview/welcome/introduction

0 reply

0 recast

1 reaction