Content pfp
Content
@
0 reply
0 recast
0 reaction

Leo pfp
Leo
@lsn
Has anyone been doing large scale thematic analysis of free text using LLMs? I'm about to start doing this at my startup and wondered if anyone had ideas about best practice and prompting Please recast to other relevant channels! 🙏
4 replies
2 recasts
11 reactions

Kasra Rahjerdi pfp
Kasra Rahjerdi
@jc4p
lemme know if you want to hop on a call to discuss this, i do it a ton (eg for a project with a training hospital we’re taking reviews of the residents and classifying/extracting concepts) // i’ve done it a ton for curating LLM training data too
1 reply
0 recast
1 reaction

Kasra Rahjerdi pfp
Kasra Rahjerdi
@jc4p
tl;dr most reliable for me has been: - grab embeddings of all unique texts (if you’re doing those, like each tweet if you’re doing tweets) - k-means clustering with whatever N you think makes sense - sampling from the center of each cluster - asking an LLM to classify the topics in each cluster
2 replies
0 recast
1 reaction

Kasra Rahjerdi pfp
Kasra Rahjerdi
@jc4p
oh, and if you're doing hardcoded categories just make a classifier (RoBERTA or etc)
1 reply
0 recast
1 reaction

Leo pfp
Leo
@lsn
When you say ‘make a classifier’ what do you mean? With hardcoded values I’ve just asked chatgpt to pick one of my categories but I assume you might something a bit more low level
1 reply
0 recast
1 reaction

Kasra Rahjerdi pfp
Kasra Rahjerdi
@jc4p
checkout this gist: https://gist.github.com/jc4p/3c76bdb5f85df8f52d8f0b0256097cc3 first file: What you're saying, having the LLM pick one of the categories, i gave it like 1000 examples and saved the data to separate JSON files second file: use a BERT-like model to learn how to look at any text and classify it (for me it was personal vs objective comments) third file: use that BERT-like model to run through the entire dataset
1 reply
0 recast
1 reaction

Kasra Rahjerdi pfp
Kasra Rahjerdi
@jc4p
oops they uploaded in wrong order: classify_full_data_fast.py --> running the final classifier claude_quantify_data_set.py --> the initial data gathering for the classifications train_classifier.ipynb --> the training code for the classifier
1 reply
0 recast
0 reaction

Leo pfp
Leo
@lsn
So you get an LLM to just do a sample, maybe 10% Random sample? Presumably it’s better to have human in the loop to quality check that 10% Then when you’re happy with it, train a BERT type model on it to do the remainder
1 reply
0 recast
1 reaction