Overview
Speculative Decoding is one method for accelerating LLM inference. This article uses OpenAI’s open-weight LLM gpt-oss-120b as a case study to measure the changes in execution time when using EAGLE-3, a speculative decoding technique, on an NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition. We then discuss under what conditions EAGLE-3 is most appropriate.
Methodology
First, we will briefly explain the methodology used in this article.
Speculative Decoding
https://arxiv.org/abs/2302.01318
Speculative Decoding is a technique to accelerate LLM decoding, leveraging the fact that inference slows down as models get larger. Before processing a given input with a large model, it is first computed by a smaller model (called a draft model). The results from the draft model are then verified by the original large model.

(Figure quoted from https://github.com/kssteven418/BigLittleDecoder/blob/main/README.md)
In the decoding phase, where tokens are generated one by one based on previous input, memory bandwidth is often the bottleneck, preventing full utilization of computational units such as Tensor Cores. By having the draft model predict multiple tokens and validating them all at once, we can expect acceleration by reducing memory access.
Since the algorithm involves verification by the original model, using speculative decoding does not change the inference results at all. This is similar to cache reuse and contrasts with methods like quantization or pruning, which sacrifice accuracy for speed.
A drawback of this method is that it requires additional computation for the draft model, thus increasing the total computational load. If the draft model’s accuracy is low, or if computation becomes a bottleneck due to handling a large number of requests, speculative decoding might actually slow down processing.
Furthermore, GPU memory usage increases by the amount required for the draft model. The increased complexity of deployment due to using multiple models is also a disadvantage.
EAGLE-3
https://arxiv.org/abs/2503.01840
Conventionally, smaller models from the same series were used as draft models in speculative decoding (e.g., Llama3.2-1B as a draft model for Llama3.3-70B). This is because models with the same tokenizer and similar training data are expected to accurately predict the distribution.
In contrast, several papers have proposed the idea of training a specialized small model to predict the distribution of a large model (e.g., Medusa https://arxiv.org/abs/2401.10774, Hydra https://arxiv.org/abs/2402.05109, EAGLE https://arxiv.org/abs/2401.15077).
EAGLE-3 is an improved version of EAGLE that enhances accuracy by obtaining input features for the draft model also from intermediate layers and by training the draft model to align with its inference-time behavior (using the draft model’s output tokens to predict the next token).
gpt-oss-120b
https://arxiv.org/abs/2508.10925
This model, released by OpenAI, has been verified as the highest-performing model of its size (150B or less) as of February 2026 (https://artificialanalysis.ai/models/gpt-oss-120b). It is licensed under Apache 2.0, allowing for a wide range of applications.
Since its parameters are quantized in MXFP4 format, it can run on a single RTX 6000 Pro Blackwell Max-Q GPU. Additionally, it uses an MoE (Mixture of Experts) structure, meaning only about 5% of its total parameters are used during inference, allowing for faster inference compared to conventional models.
However, since the model is already optimized through quantization and MoE, the room for further speedup is considered limited.
Prior Cases
For this experiment, we reviewed several similar experimental cases. None of these experiments had the exact setup of running the gpt-oss-120b model on a single RTX 6000 Blackwell Max-Q with EAGLE-3 applied.
This article introduces the effects of speculative decoding using a standard model.
This article shows cases where EAGLE-3 accelerates Llama 3.1 7B and Llama 3.3 70B inference speeds by up to 2.5 times.
This article demonstrates a case where EAGLE-3 accelerated gpt-oss-120b inference speed by 60%, achieving 650 tokens per second. The environment used 8 B200 GPUs.
This work verifies the effects of speculative decoding using multiple models and datasets. It uses H200 GPUs.
This work focuses on optimizing gpt-oss-120b using H100 GPUs. It shows that over 30% acceleration can be achieved by tuning vLLM parameters.
Experiments: Overview
We conducted the following three types of experiments:
- Experiment 1: Which library is suitable for gpt-oss-120b inference?
- Experiment 2: Which EAGLE-3 model is compatible with gpt-oss-120b?
- Experiment 3: Under what conditions is gpt-oss-120b accelerated by EAGLE-3?
All experiments were conducted in the following environment:
- CPU: Intel(R) Xeon(R) w5-2555X (14C28T)
- GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (x1)
- CUDA 12.8 (due to PyTorch and vLLM constraints)
- Python 3.12
uv venv -p 3.12
Experiment 1: Libraries
First, we measured the baseline performance without EAGLE-3 for three libraries that support speculative decoding with EAGLE-3 for gpt-oss-120b.
- vLLM 0.14.1
uv pip install vllm
- TensorRT-LLM 1.3.0rc1
docker run --rm -it --ipc host --gpus all --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc1
- SGLang 0.5.8
uv pip install sglang
We referred to the following for executing each library:
https://github.com/vllm-project/recipes/blob/main/OpenAI/GPT-OSS.md#launch-the-vllm-server
vllm serve openai/gpt-oss-120b --config GPT-OSS_Blackwell.yaml
kv-cache-dtype: fp8
compilation-config: '{"pass_config":{"fuse_allreduce_rms":true,"eliminate_noops":true}}'
async-scheduling: true
no-enable-prefix-caching: true
max-cudagraph-capture-size: 2048
max-num-batched-tokens: 8192
stream-interval: 20
https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.html
trtllm-serve "openai/gpt-oss-120b" --host 0.0.0.0 --config low_latency.yaml
enable_attention_dp: false
cuda_graph_config:
max_batch_size: 10
enable_padding: true
https://lmsys.org/blog/2025-08-27-gpt-oss
python3 -m sglang.launch_server --model-path openai/gpt-oss-120b
For measurement, we used the vllm bench script. Since the measurement is done via the OpenAI-compatible API, the same measurement script is used regardless of the library.
vllm bench serve --model openai/gpt-oss-120b --dataset-name random --random-input-len 1024 --random-output-len 1024 --ignore-eos --max-concurrency 1 --num-prompts 32
Results
vLLM and TensorRT-LLM successfully deployed with the above commands, and the measurements yielded the following results (bold indicates the best result in each row). Although TTFT (time to first token) was slightly faster with TensorRT-LLM, vLLM was significantly better in all other aspects, making it more suitable as the baseline library.
| Library | vLLM | TensorRT-LLM | SGLang |
| Output token throughput (tok/s) | 177.71 | 116.55 | – |
| Total token throughput (tok/s) | 355.43 | 233.09 | – |
| Mean TTFT (ms) | 98.43 | 82.8 | – |
| Mean TPOT (ms) | 5.54 | 8.51 | – |
Note that SGLang could not run gpt-oss-120b. As of version 0.5.8, there is a bug preventing the build of Blackwell-specific kernels for mxfp4 MoE (https://github.com/sgl-project/sglang/issues/13061). Although a fix for this bug is described at https://github.com/sgl-project/sglang/issues/13342#issuecomment-3712592985, applying it led to other errors.
Incidentally, NVIDIA DGX Spark Benchmarks already contain a measurement report, suggesting that a fix is possible (and that this is a temporary regression). With 1 GPU and no parallel inference, a 2048-token decode achieved 207.96 tokens/sec, indicating that SGLang might be faster than vLLM if it could be made to work. However, we did not pursue this further in this study.
Experiment 2: Models
The following four EAGLE-3 models trained for gpt-oss-120b are publicly available:
- nvidia/gpt-oss-120b-Eagle3-short-context (short_context)
- nvidia/gpt-oss-120b-Eagle3-long-context (long_context)
- nvidia/gpt-oss-120b-Eagle3-throughput (throughput)
- lmsys/EAGLE3-gpt-oss-120b-bf16 (lmsys)
We measured these models by swapping them in vLLM to determine which one performed best (the model name can be changed by modifying speculative_config.model). For the throughput model, we also measured with num_speculative_tokens set to 1 and 3.
kv-cache-dtype: fp8
compilation-config: '{"pass_config":{"fuse_allreduce_rms":true,"eliminate_noops":true}}'
async-scheduling: true
no-enable-prefix-caching: true
max-cudagraph-capture-size: 2048
max-num-batched-tokens: 8192
stream-interval: 20
speculative-config: '{"model":"nvidia/gpt-oss-120b-Eagle3-throughput","num_speculative_tokens":1,"method":"eagle3","draft_tensor_parallel_size":1}'
Benchmarks were conducted with the –max-concurrency option set to 1 or 4.
vllm bench serve --model openai/gpt-oss-120b --dataset-name random --random-input-len 1024 --random-output-len 1024 --ignore-eos --max-concurrency 4 --num-prompts 64
Results
The results of this experiment are shown below (bold indicates the best value for each item). Among the four models, the throughput model slightly outperformed the baseline in all metrics, demonstrating an instance where EAGLE-3 improved speed.
Conversely, both the short_context and lmsys models performed worse than the baseline across all metrics. For the throughput model, performance was below baseline when num_speculative_tokens was 3, improving only when set to 1. This indicates a need for deeper investigation into configuration settings.
Furthermore, the long_context model failed to deploy and resulted in errors even after configuration changes. Since we observed performance improvements with the throughput model, we did not pursue this further.
| Metric | baseline | short_context | long_context | baseline | throughput (num=1) | throughput (num=3) | lmsys |
|---|---|---|---|---|---|---|---|
| Maximum request concurrency | 4 | 4 | – | 1 | 1 | 1 | 1 |
| Output token throughput (tok/s) | 390.66 | 209.32 | – | 177.71 | 180.67 | 124.73 | 125.09 |
| Total token throughput (tok/s) | 781.33 | 418.64 | – | 355.43 | 361.35 | 258.95 | 250.17 |
| Mean TTFT (ms) | 255.92 | 7166.02 | – | 98.43 | 95.12 | 78.23 | 453.62 |
| Mean TPOT (ms) | 10 | 11.84 | – | 5.54 | 5.45 | 8.43 | 7.56 |
| Acceptance rate (%) | – | 27.46 | – | – | 70.58 | 19.79 | 50.85 |
| Acceptance length | – | 1.82 | – | – | 1.71 | 1.59 | 1.51 |
| Acceptance rate @ Position 0 (%) | – | 43.64 | – | – | 70.58 | 54.96 | 50.85 |
| Acceptance rate @ Position 1 (%) | – | 26.75 | – | – | – | 4.26 | – |
| Acceptance rate @ Position 2 (%) | – | 11.97 | – | – | – | 0.16 | – |
Experiment 3: Datasets and Access Patterns
Building on the previous experiments, we conducted measurements using the throughput model with vLLM under various datasets and access patterns.
Considering real-world use cases, we used the following seven dataset patterns:
- Random input
random_s: Single-turn chat (--random-input-len 1024 --random-output-len 256)random_m: Long chat / sub-agent (--random-input-len 10240 --random-output-len 256)random_l: Reasoning (--random-input-len 2048 --random-output-len 2048)random_x: Agent (--random-input-len 10240 --random-output-len 2048)
sharegptspec_benchburstgpt
We experimented with the following five access patterns:
- Throughput
throughput_low(sequential access):--request-rate=inf --max-concurrency 1 --num-prompts 16throughput_high(constant 32 concurrent accesses):--request-rate=inf --max-concurrency 32 --num-prompts 256
- Burstiness (high instantaneous load)
burstiness_high(sharper-than-usual access spikes):--request-rate=64 --burstiness 0.2 --max-concurrency 32 --num-prompts 256burstiness_mid(close to typical usage):--request-rate=8 --burstiness 1.0 --max-concurrency 32 --num-prompts 256burstiness_low(more uniform access than usual):--request-rate=2 --burstiness 4.0 --max-concurrency 32 --num-prompts 256
Results
We summarized the relative values of each metric when using EAGLE-3 compared to the baseline without EAGLE-3.
Several trends can be observed from the inference speed table. Firstly, the throughput_low setting consistently showed degradation, while other access patterns showed speed improvements. Typically, conventional speculative decoding methods accelerate in the low-throughput region like throughput_low, making this a distinctive result for EAGLE-3. Under certain conditions, up to a 27% speedup can be achieved without any change in accuracy.
Furthermore, as token lengths increased from random_s to random_m, random_l, and random_x, the speed improvement from EAGLE-3 became more significant. This suggests that for longer tokens, the computational load on the larger model increases, making the use of a draft model relatively more advantageous.
| Output token throughput | random_x | random_l | random_m | random_s | sharegpt | spec_bench | burstgpt |
|---|---|---|---|---|---|---|---|
| burstiness_high | 125.4% | 108.5% | 97.1% | 92.1% | 100.8% | 102.6% | 108.3% |
| burstiness_low | 126.2% | 114.1% | 97.7% | 99.0% | 99.3% | 99.7% | 101.8% |
| burstiness_mid | 127.0% | 112.4% | 97.1% | 100.4% | 102.0% | 106.4% | 107.7% |
| throughput_high | 124.5% | 109.2% | 92.7% | 90.3% | 108.5% | 107.7% | 107.8% |
| throughput_low | 68.3% | 83.3% | 63.3% | 79.3% | 83.8% | 91.9% | 81.5% |
Also, the acceptance rate of the draft model tended to improve with longer token lengths, which is likely contributing to the speed improvement.
| Accept ratio | random_x | random_l | random_m | random_s | sharegpt | spec_bench | burstgpt |
|---|---|---|---|---|---|---|---|
| burstiness_high | 71.1% | 70.9% | 36.9% | 46.6% | 53.0% | 70.6% | 58.5% |
| burstiness_low | 71.4% | 72.0% | 36.8% | 45.7% | 53.6% | 70.4% | 58.5% |
| burstiness_mid | 72.0% | 71.8% | 36.5% | 46.7% | 53.9% | 70.4% | 57.9% |
| throughput_high | 70.3% | 71.8% | 36.9% | 45.1% | 53.8% | 70.4% | 58.2% |
| throughput_low | 70.1% | 65.2% | 37.9% | 49.6% | 53.6% | 70.4% | 51.0% |
Furthermore, TTFT (Time To First Token) significantly decreases for longer tokens. This reflects that a reduction in waiting time for new queries is a secondary result of increased inference throughput.
| TTFT | random_x | random_l | random_m | random_s | sharegpt | spec_bench | burstgpt |
|---|---|---|---|---|---|---|---|
| burstiness_high | 34.1% | 44.9% | 86.3% | 50.9% | 186.1% | 234.7% | 130.7% |
| burstiness_low | 24.9% | 114.9% | 74.1% | 129.8% | 214.1% | 226.7% | 123.3% |
| burstiness_mid | 32.0% | 69.2% | 81.0% | 110.0% | 163.2% | 180.4% | 128.5% |
| throughput_high | 35.0% | 49.7% | 86.7% | 52.1% | 144.3% | 63.3% | 180.5% |
| throughput_low | 109.6% | 109.1% | 109.9% | 113.5% | 117.2% | 150.7% | 111.0% |
Summary
To summarize the experimental results, the following conclusions can be drawn:
- Experiment 1: vLLM is the most suitable library for gpt-oss-120b inference.
- Experiment 2: The
nvidia/gpt-oss-120b-Eagle3-throughputmodel is the most compatible EAGLE-3 model for gpt-oss-120b. - Experiment 3: gpt-oss-120b is accelerated by EAGLE-3 under conditions of moderate access frequency and long input texts.
Speculative decoding, including EAGLE-3, is a complex technique where performance can either improve or degrade depending on the model, settings, and usage scenario. We intend to continue monitoring new developments in this technology.