might've under estimated how long it'd take me to process 157 million casts, but in 6 hours i'll have authoritative rankings on who said what words first!!

ooh nice i was thinking of building this into our agent but decided it was too much work

is it exact text/ngram or embedding based?

basically i wanted to be able to answer if someone asked "i saw this cast by ___ recently, are there other people who discussed similar things before?"

this rn is lemmatization + tokenizing but for your case you def need embeddings, can generate those embeddings pretty cheaply on a local GPU using the hf dataset and https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 tho!

@operator & /operator | BCI, bioinformatics, search, chaos, decentralized AI

i won an optimism grant to do something like this and oss it but never started cuz they never gave me the money lmao

something like this would be very useful so clients all have access to a base level solid search engine

lmaoooo yeah if you have a local RTX GPU it’s very doable for cheap but will take a day or two to run, if you have $100-$200 to waste it can be done on a remote machine much quicker

the hard part is constantly updating it cause the hub api doesn’t have any concept of “get me casts from the last 24 hours”