𝚐𝔪𝟾𝚡𝚡𝟾

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

SpecExec is a speculative decoding method that accelerates inference for offloaded LLMs. It uses a draft model to create a tree of likely token continuations and a target model to cache the continuation probabilities for all prefixes in this tree. SpecExec achieves speedups of up to 18.7 times compared to autoregressive decoding with offloading.

↓

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

SpecExec is a speculative decoding method that accelerates inference for offloaded LLMs. It uses a draft model to create a tree of likely token continuations and a target model to cache the continuation probabilities for all prefixes in this tree. SpecExec achieves speedups of up to 18.7 times compared to autoregressive decoding with offloading.

↓

https://www.together.ai/blog/specexec