UCSD Lab's JetSpec Method Speeds Up AI by Up to 9.64x
JetSpec claims to solve waste problems in both autoregressive and block-diffusion speculative decoding, reaching over 1000 tokens per second on an NVIDIA B200 with Qwen3-8B.
Reporting from 1 sources: GIGAZINE.
Hao AI Lab at UC San Diego has developed JetSpec, a speculative decoding method that accelerates large language models. On the MATH-500 benchmark, JetSpec achieved a 9.64x speedup over standard inference when running Qwen3-8B on an NVIDIA H100. The lab released draft models, code, and a paper.
Hao AI Lab at UC San Diego published JetSpec, a speculative decoding method that accelerates large language model inference. The method uses parallel tree drafting to avoid waste in both autoregressive and block-diffusion approaches. On the MATH-500 math reasoning benchmark, JetSpec achieved a 9.64x speedup over standard inference when running Qwen3-8B on an NVIDIA H100, and a 4.58x speedup on the MT-Bench chat benchmark. The lab also ran Qwen3-8B on an NVIDIA B200 using a modified version of the vLLM inference engine, reaching over 1000 tokens per second.
Hao AI Lab released JetSpec draft models for Qwen3-8B, Qwen3 30B A3B, Qwen3.6 35B A3B, gpt-oss-20b, Gemma 4 26B A4B IT, and Step 3.7 Flash. The paper and code are available on arXiv and GitHub.
Synthesized by Yomimono from the 1 cited source below, including Japanese-language reporting where cited, then editorially reviewed before publishing.