Data

Has anyone been doing large scale thematic analysis of free text using LLMs?

I'm about to start doing this at my startup and wondered if anyone had ideas about best practice and prompting 

Please recast to other relevant channels! 🙏

Be specific.    
Building productscore.org, @germanify // leonasskau.co.uk // Hosting /geopolitics, /strategy, /leo
d33m:mun

lemme know if you want to hop on a call to discuss this, i do it a ton (eg for a project with a training hospital we’re taking reviews of the residents and classifying/extracting concepts) // i’ve done it a ton for curating LLM training data too

tl;dr most reliable for me has been:
- grab embeddings of all unique texts (if you’re doing those, like each tweet if you’re doing tweets)
- k-means clustering with whatever N you think makes sense
- sampling from the center of each cluster
- asking an LLM to classify the topics in each cluster

oh, and if you're doing hardcoded categories just make a classifier (RoBERTA or etc)

checkout this gist: https://gist.github.com/jc4p/3c76bdb5f85df8f52d8f0b0256097cc3

first file: What you're saying, having the LLM pick one of the categories, i gave it like 1000 examples and saved the data to separate JSON files
second file: use a BERT-like model to learn how to look at any text and classify it (for me it was personal vs objective comments)
third file: use that BERT-like model to run through the entire dataset

When you say ‘make a classifier’ what do you mean?
With hardcoded values I’ve just asked chatgpt to pick one of my categories but I assume you might something a bit more low level