Kasra Rahjerdi pfp
Kasra Rahjerdi
@jc4p
(screams internally)
8 replies
4 recasts
49 reactions

vanishingideal pfp
vanishingideal
@vanishingideal
Have you considered int8 embeddings? https://huggingface.co/blog/embedding-quantization
1 reply
0 recast
4 reactions

Kasra Rahjerdi pfp
Kasra Rahjerdi
@jc4p
no i didn't even know that was a thing, thank you!!!!!! if i can't get the FP16 working i'll switch to this
2 replies
0 recast
1 reaction

Kasra Rahjerdi pfp
Kasra Rahjerdi
@jc4p
right now i'm working on uploading all the chunks to: https://huggingface.co/datasets/jc4p/farcaster-casts-embeddings/tree/main then was planning on nuking the HDs and resuming from the last chunk on each instance (it's 4xGH200)
1 reply
0 recast
0 reaction

vanishingideal pfp
vanishingideal
@vanishingideal
How are you cleaning casts? How are you approaching chunks? Only seeing a hash/embedding pair here.
1 reply
0 recast
1 reaction

Kasra Rahjerdi pfp
Kasra Rahjerdi
@jc4p
i'm running this script on 4xGH200 against the dataset from jc4p/farcasters-casts: https://gist.github.com/jc4p/93c2887453f1852fc716dd364577162f
1 reply
0 recast
0 reaction

vanishingideal pfp
vanishingideal
@vanishingideal
Chunking by FID might have the benefit of isolating failure. Hope it doesn't cost an arm on lambdalabs. Cleaning data and removing stop-words might lead to better semantic results.
0 reply
0 recast
0 reaction