Content
@
0 reply
0 recast
0 reaction
ππͺπΎπ‘π‘πΎ
@gm8xx8
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices SpecExec is a speculative decoding method that accelerates inference for offloaded LLMs. It uses a draft model to create a tree of likely token continuations and a target model to cache the continuation probabilities for all prefixes in this tree. SpecExec achieves speedups of up to 18.7 times compared to autoregressive decoding with offloading. β https://www.together.ai/blog/specexec
0 reply
1 recast
4 reactions