Dan Romero on Warpcast

Content pfp

0 reply

0 recast

0 reaction

Dan Romero pfp

Let's say you have a corpus of text — 10 million words — about a specific topic. 1. What's the best way to "train a model" on that text? 2. Is that even the right term? Or is it using an existing foundational model and then augmenting it? Fine-tuning it? Something else?

15 replies

6 recasts

61 reactions

osama pfp

depends on usecase. you don’t need to train/finetune for mvp. just need rag and prompt engineering. if hallucinations is a problem eg health care, try deterministic quoting. happy to answer q’s as i have deployed these for clients across real-estate and one underway w/ smart contracts (on base most prolly)

1 reply

0 recast

4 reactions

Daniel - Bountycaster pfp

Daniel - Bountycaster

What should the model do? would this be an instruction-based model (answer questions - similar to ChatGPT)?

1 reply

0 recast

0 reaction

beeboop pfp

Training is correct, it's an umbrella term. Fine-tuning refers to the "voice" of the LLM i.e. their linguistics. RAG or "Retrieval-Augmented Generation" refers to the addition of a new corpus of data on top of the foundation model data. You likely want RAG + Fine-Tuning to achieve your goal.

0 reply

0 recast

2 reactions

Inceptionally 🎩 pfp

Inceptionally 🎩

Not sure what use cases you are trying to solve and I don't know if it scales to 10 million words, but NotebookLM https://notebooklm.google/ from Google does much of what you are asking I suspect... unless you are looking to do something more automated and ongoing!

0 reply

0 recast

1 reaction

Paul Dowman 🔴✨ pfp

Paul Dowman 🔴✨

@pauldowman.eth

RAG. https://github.blog/2024-04-04-what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/

0 reply

0 recast

0 reaction

Rani Haddad pfp

u can just do a custom GPT on openai. takes 5 minutes.

2 replies

0 recast

0 reaction

Stephan pfp

GPTs on chatgpt should do the job with the least hassle if you are just looking to ask questions about the corpus https://warpcast.com/stephancill/0xc2366d42

0 reply

0 recast

0 reaction

Tom Jeans pfp

you are describing fine-tuning OpenAI custom GPTs make this very easy but I’m not sure about the max upload size or what type of interface you’re looking to use for your specialized model fine-tuning a Llama 3 is probably the best bet if you want more than a quick hacky solution

0 reply

0 recast

0 reaction

Nick pfp

you want to lean heavily on retrieval augmented generation (RAG), let me follow up with some resources working on something similar albeit a smaller corpus for a muni

1 reply

0 recast

0 reaction

Giuliano Giacaglia 🌲 pfp

Giuliano Giacaglia 🌲

Fine tune a pre-existing model

0 reply

0 recast

0 reaction

Tmophoto pfp

It depends on how you want to use it. if you need to ask an AI questions about it or come up with new ideas from it will require a different approaches. asking questions is just a general AI that has a huge memory that you can load it all in.

0 reply

0 recast

0 reaction

Maxi pfp

You should talk to @seref.eth

0 reply

0 recast

0 reaction

K pfp

Those 10M words should be in natural language, then *use https://www.trychroma.com/ to push data and connect with LLM. * create set of QA to verify satisfaction * create more data from existing corpus to reduce hallucination Interested in PoC for this, LMK

0 reply

0 recast

0 reaction

ashesfall.eth pfp

That’s simply not a large enough corpus for (present-day) ML systems to derive useful comprehension of the language that the text is written in. So your only choice, if you want useful language generation, is to fine-tune an existing model to take advantage of language capabilities derived from a much larger dataset.

0 reply

0 recast

0 reaction

Lucas Lejeune pfp

Fine tuning a pre existing model would be the best way I believe. Probably with python, or there's a webui called oogabooga which lets you do just that

0 reply

0 recast

0 reaction