Content pfp
Content
@
0 reply
0 recast
0 reaction

shoni.eth pfp
shoni.eth
@alexpaden
finally cracked the optimization and started pushing 400,000 casts (texts) per second to inference on the mac studio
5 replies
2 recasts
17 reactions

rish pfp
rish
@rish
what was the solve? and what is the end outcome you're going for?
1 reply
0 recast
0 reaction

shoni.eth pfp
shoni.eth
@alexpaden
really a few simple things like preallocated buffers, torchscript/quantize/mps on mac, then the thing i hadn't caught was some delays around making sure the models were always getting a steady stream of data which was just query/index based. this pipeline specifically is for cast text embeddings, probably a similar type you guys will be providing (384, int8 quantized for storage, float16 model precision)
0 reply
0 recast
1 reaction