Content pfp
Content
@
0 reply
0 recast
0 reaction

Dan Romero pfp
Dan Romero
@dwr.eth
Let's say you have a corpus of text — 10 million words — about a specific topic. 1. What's the best way to "train a model" on that text? 2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?
22 replies
11 recasts
100 reactions

Daniel - Bountycaster pfp
Daniel - Bountycaster
@pirosb3
What should the model do? would this be an instruction-based model (answer questions - similar to ChatGPT)?
1 reply
0 recast
2 reactions

osama pfp
osama
@osama
depends on usecase. you don’t need to train/finetune for mvp. just need rag and prompt engineering. if hallucinations is a problem eg health care, try deterministic quoting. happy to answer q’s as i have deployed these for clients across real-estate and one underway w/ smart contracts (on base most prolly)
1 reply
0 recast
8 reactions

beeboop pfp
beeboop
@beeboop.eth
Training is correct, it's an umbrella term. Fine-tuning refers to the "voice" of the LLM i.e. their linguistics. RAG or "Retrieval-Augmented Generation" refers to the addition of a new corpus of data on top of the foundation model data. You likely want RAG + Fine-Tuning to achieve your goal.
0 reply
0 recast
3 reactions

Nick pfp
Nick
@nickporter
you want to lean heavily on retrieval augmented generation (RAG), let me follow up with some resources working on something similar albeit a smaller corpus for a muni
1 reply
0 recast
1 reaction

Tmophoto 🟡🎩 pfp
Tmophoto 🟡🎩
@tmophoto
It depends on how you want to use it. if you need to ask an AI questions about it or come up with new ideas from it will require a different approaches. asking questions is just a general AI that has a huge memory that you can load it all in.
0 reply
0 recast
1 reaction

Tom Jeans pfp
Tom Jeans
@thomasjeans
you are describing fine-tuning OpenAI custom GPTs make this very easy but I’m not sure about the max upload size or what type of interface you’re looking to use for your specialized model fine-tuning a Llama 3 is probably the best bet if you want more than a quick hacky solution
0 reply
0 recast
1 reaction

not parzival pfp
not parzival
@shoni.eth
training is good for it replies with sarcasm or it replies with prepared and filtered responses. you're talking about embedding new data for unique results i think i am starting with openai gpt but then others hence the name unbias
0 reply
0 recast
1 reaction

Paul Dowman pfp
Paul Dowman
@pauldowman.eth
RAG. https://github.blog/2024-04-04-what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/
0 reply
0 recast
0 reaction

Rani pfp
Rani
@4484
u can just do a custom GPT on openai. takes 5 minutes.
1 reply
0 recast
0 reaction

Stephan pfp
Stephan
@stephancill
GPTs on chatgpt should do the job with the least hassle if you are just looking to ask questions about the corpus https://warpcast.com/stephancill/0xc2366d42
0 reply
0 recast
0 reaction

Giuliano Giacaglia pfp
Giuliano Giacaglia
@giu
Fine tune a pre-existing model
0 reply
0 recast
0 reaction

Inceptionally 🎩  pfp
Inceptionally 🎩
@inceptionally
Not sure what use cases you are trying to solve and I don't know if it scales to 10 million words, but NotebookLM https://notebooklm.google/ from Google does much of what you are asking I suspect... unless you are looking to do something more automated and ongoing!
0 reply
0 recast
1 reaction

K 🎩🔆 pfp
K 🎩🔆
@kijijij
Those 10M words should be in natural language, then *use https://www.trychroma.com/ to push data and connect with LLM. * create set of QA to verify satisfaction * create more data from existing corpus to reduce hallucination Interested in PoC for this, LMK
0 reply
0 recast
0 reaction

Ashish pfp
Ashish
@iashish.eth
@mk say something
0 reply
0 recast
0 reaction

Lucas Lejeune pfp
Lucas Lejeune
@lucaslejeune
Fine tuning a pre existing model would be the best way I believe. Probably with python, or there's a webui called oogabooga which lets you do just that
0 reply
0 recast
1 reaction

Gengar368 pfp
Gengar368
@gen0x
1. Train model: Use AI techniques like deep learning. 2. Fine-tune: Adapt existing model for specifics.
0 reply
0 recast
0 reaction

Maxi pfp
Maxi
@maxast
You should talk to @seref.eth
0 reply
0 recast
0 reaction

ashesfall.eth pfp
ashesfall.eth
@ashesfall
That’s simply not a large enough corpus for (present-day) ML systems to derive useful comprehension of the language that the text is written in. So your only choice, if you want useful language generation, is to fine-tune an existing model to take advantage of language capabilities derived from a much larger dataset.
0 reply
0 recast
0 reaction

Ruby🎩🔵🐹 pfp
Ruby🎩🔵🐹
@ruby1998
The best way to train a model on a collection of 10 million words is to use techniques like fine-tuning a pre-existing language model or training a new model from scratch, depending on the specific task and data available.
0 reply
0 recast
0 reaction