Content pfp
Content
@
0 reply
0 recast
0 reaction

Dan Romero pfp
Dan Romero
@dwr.eth
Let's say you have a corpus of text — 10 million words — about a specific topic. 1. What's the best way to "train a model" on that text? 2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?
18 replies
2 recasts
117 reactions

Daniel - Bountycaster pfp
Daniel - Bountycaster
@pirosb3
What should the model do? would this be an instruction-based model (answer questions - similar to ChatGPT)?
1 reply
0 recast
0 reaction

Dan Romero pfp
Dan Romero
@dwr.eth
Yeah ability to give you answers based on what is in the corpus but nothing else
4 replies
0 recast
0 reaction

Daniel - Bountycaster pfp
Daniel - Bountycaster
@pirosb3
NOTE: not an expert in any way What I'd do: - Index the documents in a vector database - 10M words won't fit in 1 prompt - Use popular RAG methods to 1) retrieve relevant documents upon query and 2) generate the answers by including those documents in the prompt
1 reply
0 recast
2 reactions

Stephan pfp
Stephan
@stephancill
this is the most popular approach to achieve what you want to do @dwr.eth if your corpus is PDF documents iirc you can easily do this with a custom GPT on chatgpt doing it manually will take a bit more effort and some command line tinkering - resources in next cast ->
1 reply
0 recast
1 reaction

Stephan pfp
Stephan
@stephancill
did a quick search for 'easy RAG LLM' and found this https://github.com/weaviate/Verba haven't used it before but it looks like you just need to supply it with API keys and it will provide you a chat interface to the vector data you've indexed on weviate
1 reply
0 recast
0 reaction