Content pfp
Content
@
0 reply
0 recast
0 reaction

Dan Romero pfp
Dan Romero
@dwr.eth
Let's say you have a corpus of text — 10 million words — about a specific topic. 1. What's the best way to "train a model" on that text? 2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?
18 replies
2 recasts
47 reactions

osama pfp
osama
@osama
depends on usecase. you don’t need to train/finetune for mvp. just need rag and prompt engineering. if hallucinations is a problem eg health care, try deterministic quoting. happy to answer q’s as i have deployed these for clients across real-estate and one underway w/ smart contracts (on base most prolly)
1 reply
0 recast
5 reactions

Stephan pfp
Stephan
@stephancill
GPTs on chatgpt should do the job with the least hassle if you are just looking to ask questions about the corpus https://warpcast.com/stephancill/0xc2366d42
0 reply
0 recast
0 reaction

Daniel - Bountycaster pfp
Daniel - Bountycaster
@pirosb3
What should the model do? would this be an instruction-based model (answer questions - similar to ChatGPT)?
1 reply
0 recast
1 reaction

beeboop pfp
beeboop
@beeboop.eth
Training is correct, it's an umbrella term. Fine-tuning refers to the "voice" of the LLM i.e. their linguistics. RAG or "Retrieval-Augmented Generation" refers to the addition of a new corpus of data on top of the foundation model data. You likely want RAG + Fine-Tuning to achieve your goal.
0 reply
0 recast
3 reactions

Nick pfp
Nick
@nickporter
you want to lean heavily on retrieval augmented generation (RAG), let me follow up with some resources working on something similar albeit a smaller corpus for a muni
1 reply
0 recast
1 reaction

Tmophoto  pfp
Tmophoto
@tmophoto
It depends on how you want to use it. if you need to ask an AI questions about it or come up with new ideas from it will require a different approaches. asking questions is just a general AI that has a huge memory that you can load it all in.
0 reply
0 recast
1 reaction

Inceptionally 🎩   pfp
Inceptionally 🎩
@inceptionally
Not sure what use cases you are trying to solve and I don't know if it scales to 10 million words, but NotebookLM https://notebooklm.google/ from Google does much of what you are asking I suspect... unless you are looking to do something more automated and ongoing!
0 reply
0 recast
1 reaction

Paul Dowman 🔴✨ pfp
Paul Dowman 🔴✨
@pauldowman.eth
RAG. https://github.blog/2024-04-04-what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/
0 reply
0 recast
0 reaction

Rani pfp
Rani
@4484
u can just do a custom GPT on openai. takes 5 minutes.
2 replies
0 recast
0 reaction

Tom Jeans pfp
Tom Jeans
@thomasjeans
you are describing fine-tuning OpenAI custom GPTs make this very easy but I’m not sure about the max upload size or what type of interface you’re looking to use for your specialized model fine-tuning a Llama 3 is probably the best bet if you want more than a quick hacky solution
0 reply
0 recast
0 reaction

Giuliano Giacaglia pfp
Giuliano Giacaglia
@giu
Fine tune a pre-existing model
0 reply
0 recast
0 reaction

wizard not parzival pfp
wizard not parzival
@shoni.eth
training is good for it replies with sarcasm or it replies with prepared and filtered responses. you're talking about embedding new data for unique results i think i am starting with openai gpt but then others hence the name unbias
0 reply
0 recast
0 reaction

Maxi pfp
Maxi
@maxast.eth
You should talk to @seref.eth
0 reply
0 recast
0 reaction

K pfp
K
@kijijij
Those 10M words should be in natural language, then *use https://www.trychroma.com/ to push data and connect with LLM. * create set of QA to verify satisfaction * create more data from existing corpus to reduce hallucination Interested in PoC for this, LMK
0 reply
0 recast
0 reaction

ashesfall.eth pfp
ashesfall.eth
@ashesfall
That’s simply not a large enough corpus for (present-day) ML systems to derive useful comprehension of the language that the text is written in. So your only choice, if you want useful language generation, is to fine-tune an existing model to take advantage of language capabilities derived from a much larger dataset.
0 reply
0 recast
0 reaction

Lucas Lejeune pfp
Lucas Lejeune
@lucaslejeune
Fine tuning a pre existing model would be the best way I believe. Probably with python, or there's a webui called oogabooga which lets you do just that
0 reply
0 recast
0 reaction