fam in ~18 hours we'll have 157 million casts with embeddings, pray for me it doesn't crash overnight

Have you considered int8 embeddings?

https://huggingface.co/blog/embedding-quantization

no i didn't even know that was a thing, thank you!!!!!! if i can't get the FP16 working i'll switch to this

right now i'm working on uploading all the chunks to: https://huggingface.co/datasets/jc4p/farcaster-casts-embeddings/tree/main

then was planning on nuking the HDs and resuming from the last chunk on each instance (it's 4xGH200)

How are you cleaning casts? How are you approaching chunks? Only seeing a hash/embedding pair here.

i'm running this script on 4xGH200 against the dataset from jc4p/farcasters-casts: https://gist.github.com/jc4p/93c2887453f1852fc716dd364577162f

Chunking by FID might have the benefit of isolating failure. Hope it doesn't cost an arm on lambdalabs. Cleaning data and removing stop-words might lead to better semantic results.