Accelerating Inference of gpt-oss-120b with EAGLE-3

kota.iizuka |March 9, 2026 | AI Engineering

Overview

Speculative Decoding is one method for accelerating LLM inference. This article uses OpenAI’s open-weight LLM gpt-oss-120b as a case study to measure the changes in execution time when using EAGLE-3, a speculative decoding technique, on an NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition. We then discuss under what conditions EAGLE-3 is most appropriate.

Methodology

First, we will briefly explain the methodology used in this article.

Speculative Decoding

https://arxiv.org/abs/2302.01318

Speculative Decoding is a technique to accelerate LLM decoding, leveraging the fact that inference slows down as models get larger. Before processing a given input with a large model, it is first computed by a smaller model (called a draft model). The results from the draft model are then verified by the original large model.

(Figure quoted from https://github.com/kssteven418/BigLittleDecoder/blob/main/README.md)

In the decoding phase, where tokens are generated one by one based on previous input, memory bandwidth is often the bottleneck, preventing full utilization of computational units such as Tensor Cores. By having the draft model predict multiple tokens and validating them all at once, we can expect acceleration by reducing memory access.

Since the algorithm involves verification by the original model, using speculative decoding does not change the inference results at all. This is similar to cache reuse and contrasts with methods like quantization or pruning, which sacrifice accuracy for speed.

A drawback of this method is that it requires additional computation for the draft model, thus increasing the total computational load. If the draft model’s accuracy is low, or if computation becomes a bottleneck due to handling a large number of requests, speculative decoding might actually slow down processing.

Furthermore, GPU memory usage increases by the amount required for the draft model. The increased complexity of deployment due to using multiple models is also a disadvantage.

EAGLE-3

https://arxiv.org/abs/2503.01840

Conventionally, smaller models from the same series were used as draft models in speculative decoding (e.g., Llama3.2-1B as a draft model for Llama3.3-70B). This is because models with the same tokenizer and similar training data are expected to accurately predict the distribution.

In contrast, several papers have proposed the idea of training a specialized small model to predict the distribution of a large model (e.g., Medusa https://arxiv.org/abs/2401.10774, Hydra https://arxiv.org/abs/2402.05109, EAGLE https://arxiv.org/abs/2401.15077).

EAGLE-3 is an improved version of EAGLE that enhances accuracy by obtaining input features for the draft model also from intermediate layers and by training the draft model to align with its inference-time behavior (using the draft model’s output tokens to predict the next token).

gpt-oss-120b

https://arxiv.org/abs/2508.10925

This model, released by OpenAI, has been verified as the highest-performing model of its size (150B or less) as of February 2026 (https://artificialanalysis.ai/models/gpt-oss-120b). It is licensed under Apache 2.0, allowing for a wide range of applications.

Since its parameters are quantized in MXFP4 format, it can run on a single RTX 6000 Pro Blackwell Max-Q GPU. Additionally, it uses an MoE (Mixture of Experts) structure, meaning only about 5% of its total parameters are used during inference, allowing for faster inference compared to conventional models.

However, since the model is already optimized through quantization and MoE, the room for further speedup is considered limited.

Prior Cases

For this experiment, we reviewed several similar experimental cases. None of these experiments had the exact setup of running the gpt-oss-120b model on a single RTX 6000 Blackwell Max-Q with EAGLE-3 applied.

https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding/

This article introduces the effects of speculative decoding using a standard model.

https://developers.redhat.com/articles/2025/07/01/fly-eagle3-fly-faster-inference-vllm-speculative-decoding

This article shows cases where EAGLE-3 accelerates Llama 3.1 7B and Llama 3.3 70B inference speeds by up to 2.5 times.

https://www.baseten.co/blog/how-we-made-the-fastest-gpt-oss-on-nvidia-gpus-60-percent-faster/

This article demonstrates a case where EAGLE-3 accelerated gpt-oss-120b inference speed by 60%, achieving 650 tokens per second. The environment used 8 B200 GPUs.

https://docs.sglang.io/SpecForge/SpecBundle/index.html

This work verifies the effects of speculative decoding using multiple models and datasets. It uses H200 GPUs.

https://docs.gpustack.ai/2.0/performance-lab/gpt-oss-120b/h100/

This work focuses on optimizing gpt-oss-120b using H100 GPUs. It shows that over 30% acceleration can be achieved by tuning vLLM parameters.

Experiments: Overview

We conducted the following three types of experiments:

Experiment 1: Which library is suitable for gpt-oss-120b inference?
Experiment 2: Which EAGLE-3 model is compatible with gpt-oss-120b?
Experiment 3: Under what conditions is gpt-oss-120b accelerated by EAGLE-3?

All experiments were conducted in the following environment:

CPU: Intel(R) Xeon(R) w5-2555X (14C28T)
GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (x1)
CUDA 12.8 (due to PyTorch and vLLM constraints)
Python 3.12
- uv venv -p 3.12

Experiment 1: Libraries

First, we measured the baseline performance without EAGLE-3 for three libraries that support speculative decoding with EAGLE-3 for gpt-oss-120b.

vLLM 0.14.1
- uv pip install vllm
TensorRT-LLM 1.3.0rc1
- docker run --rm -it --ipc host --gpus all --ulimit memlock=-1 --ulimit stack=67108864 -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc1
SGLang 0.5.8
- uv pip install sglang

We referred to the following for executing each library:

https://github.com/vllm-project/recipes/blob/main/OpenAI/GPT-OSS.md#launch-the-vllm-server

vllm serve openai/gpt-oss-120b --config GPT-OSS_Blackwell.yaml

https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.html

trtllm-serve "openai/gpt-oss-120b" --host 0.0.0.0 --config low_latency.yaml

enable_attention_dp: false cuda_graph_config: max_batch_size: 10 enable_padding: true

https://lmsys.org/blog/2025-08-27-gpt-oss

python3 -m sglang.launch_server --model-path openai/gpt-oss-120b

For measurement, we used the vllm bench script. Since the measurement is done via the OpenAI-compatible API, the same measurement script is used regardless of the library.

vllm bench serve --model openai/gpt-oss-120b --dataset-name random --random-input-len 1024 --random-output-len 1024 --ignore-eos --max-concurrency 1 --num-prompts 32

Results

vLLM and TensorRT-LLM successfully deployed with the above commands, and the measurements yielded the following results (bold indicates the best result in each row). Although TTFT (time to first token) was slightly faster with TensorRT-LLM, vLLM was significantly better in all other aspects, making it more suitable as the baseline library.

Library	vLLM	TensorRT-LLM	SGLang
Output token throughput (tok/s)	177.71	116.55	–
Total token throughput (tok/s)	355.43	233.09	–
Mean TTFT (ms)	98.43	82.8	–
Mean TPOT (ms)	5.54	8.51	–

Note that SGLang could not run gpt-oss-120b. As of version 0.5.8, there is a bug preventing the build of Blackwell-specific kernels for mxfp4 MoE (https://github.com/sgl-project/sglang/issues/13061). Although a fix for this bug is described at https://github.com/sgl-project/sglang/issues/13342#issuecomment-3712592985, applying it led to other errors.

Incidentally, NVIDIA DGX Spark Benchmarks already contain a measurement report, suggesting that a fix is possible (and that this is a temporary regression). With 1 GPU and no parallel inference, a 2048-token decode achieved 207.96 tokens/sec, indicating that SGLang might be faster than vLLM if it could be made to work. However, we did not pursue this further in this study.

Experiment 2: Models

The following four EAGLE-3 models trained for gpt-oss-120b are publicly available:

nvidia/gpt-oss-120b-Eagle3-short-context (short_context)
nvidia/gpt-oss-120b-Eagle3-long-context (long_context)
nvidia/gpt-oss-120b-Eagle3-throughput (throughput)
lmsys/EAGLE3-gpt-oss-120b-bf16 (lmsys)

We measured these models by swapping them in vLLM to determine which one performed best (the model name can be changed by modifying speculative_config.model). For the throughput model, we also measured with num_speculative_tokens set to 1 and 3.

kv-cache-dtype: fp8 compilation-config: '{"pass_config":{"fuse_allreduce_rms":true,"eliminate_noops":true}}' async-scheduling: true no-enable-prefix-caching: true max-cudagraph-capture-size: 2048 max-num-batched-tokens: 8192 stream-interval: 20 speculative-config: '{"model":"nvidia/gpt-oss-120b-Eagle3-throughput","num_speculative_tokens":1,"method":"eagle3","draft_tensor_parallel_size":1}'

Benchmarks were conducted with the –max-concurrency option set to 1 or 4.

vllm bench serve --model openai/gpt-oss-120b --dataset-name random --random-input-len 1024 --random-output-len 1024 --ignore-eos --max-concurrency 4 --num-prompts 64

Results

The results of this experiment are shown below (bold indicates the best value for each item). Among the four models, the throughput model slightly outperformed the baseline in all metrics, demonstrating an instance where EAGLE-3 improved speed.

Conversely, both the short_context and lmsys models performed worse than the baseline across all metrics. For the throughput model, performance was below baseline when num_speculative_tokens was 3, improving only when set to 1. This indicates a need for deeper investigation into configuration settings.

Furthermore, the long_context model failed to deploy and resulted in errors even after configuration changes. Since we observed performance improvements with the throughput model, we did not pursue this further.

Metric	baseline	short_context	long_context	baseline	throughput (num=1)	throughput (num=3)	lmsys
Maximum request concurrency	4	4	–	1	1	1	1
Output token throughput (tok/s)	390.66	209.32	–	177.71	180.67	124.73	125.09
Total token throughput (tok/s)	781.33	418.64	–	355.43	361.35	258.95	250.17
Mean TTFT (ms)	255.92	7166.02	–	98.43	95.12	78.23	453.62
Mean TPOT (ms)	10	11.84	–	5.54	5.45	8.43	7.56
Acceptance rate (%)	–	27.46	–	–	70.58	19.79	50.85
Acceptance length	–	1.82	–	–	1.71	1.59	1.51
Acceptance rate @ Position 0 (%)	–	43.64	–	–	70.58	54.96	50.85
Acceptance rate @ Position 1 (%)	–	26.75	–	–	–	4.26	–
Acceptance rate @ Position 2 (%)	–	11.97	–	–	–	0.16	–

Experiment 3: Datasets and Access Patterns

Building on the previous experiments, we conducted measurements using the throughput model with vLLM under various datasets and access patterns.

Considering real-world use cases, we used the following seven dataset patterns:

Random input
- random_s: Single-turn chat (--random-input-len 1024 --random-output-len 256)
- random_m: Long chat / sub-agent (--random-input-len 10240 --random-output-len 256)
- random_l: Reasoning (--random-input-len 2048 --random-output-len 2048)
- random_x: Agent (--random-input-len 10240 --random-output-len 2048)
sharegpt
spec_bench
burstgpt

We experimented with the following five access patterns:

Throughput
- throughput_low (sequential access): --request-rate=inf --max-concurrency 1 --num-prompts 16
- throughput_high (constant 32 concurrent accesses): --request-rate=inf --max-concurrency 32 --num-prompts 256
Burstiness (high instantaneous load)
- burstiness_high (sharper-than-usual access spikes): --request-rate=64 --burstiness 0.2 --max-concurrency 32 --num-prompts 256
- burstiness_mid (close to typical usage): --request-rate=8 --burstiness 1.0 --max-concurrency 32 --num-prompts 256
- burstiness_low (more uniform access than usual): --request-rate=2 --burstiness 4.0 --max-concurrency 32 --num-prompts 256

Results

We summarized the relative values of each metric when using EAGLE-3 compared to the baseline without EAGLE-3.

Several trends can be observed from the inference speed table. Firstly, the throughput_low setting consistently showed degradation, while other access patterns showed speed improvements. Typically, conventional speculative decoding methods accelerate in the low-throughput region like throughput_low, making this a distinctive result for EAGLE-3. Under certain conditions, up to a 27% speedup can be achieved without any change in accuracy.

Furthermore, as token lengths increased from random_s to random_m, random_l, and random_x, the speed improvement from EAGLE-3 became more significant. This suggests that for longer tokens, the computational load on the larger model increases, making the use of a draft model relatively more advantageous.

Output token throughput	random_x	random_l	random_m	random_s	sharegpt	spec_bench	burstgpt
burstiness_high	125.4%	108.5%	97.1%	92.1%	100.8%	102.6%	108.3%
burstiness_low	126.2%	114.1%	97.7%	99.0%	99.3%	99.7%	101.8%
burstiness_mid	127.0%	112.4%	97.1%	100.4%	102.0%	106.4%	107.7%
throughput_high	124.5%	109.2%	92.7%	90.3%	108.5%	107.7%	107.8%
throughput_low	68.3%	83.3%	63.3%	79.3%	83.8%	91.9%	81.5%

Also, the acceptance rate of the draft model tended to improve with longer token lengths, which is likely contributing to the speed improvement.

Accept ratio	random_x	random_l	random_m	random_s	sharegpt	spec_bench	burstgpt
burstiness_high	71.1%	70.9%	36.9%	46.6%	53.0%	70.6%	58.5%
burstiness_low	71.4%	72.0%	36.8%	45.7%	53.6%	70.4%	58.5%
burstiness_mid	72.0%	71.8%	36.5%	46.7%	53.9%	70.4%	57.9%
throughput_high	70.3%	71.8%	36.9%	45.1%	53.8%	70.4%	58.2%
throughput_low	70.1%	65.2%	37.9%	49.6%	53.6%	70.4%	51.0%

Furthermore, TTFT (Time To First Token) significantly decreases for longer tokens. This reflects that a reduction in waiting time for new queries is a secondary result of increased inference throughput.

TTFT	random_x	random_l	random_m	random_s	sharegpt	spec_bench	burstgpt
burstiness_high	34.1%	44.9%	86.3%	50.9%	186.1%	234.7%	130.7%
burstiness_low	24.9%	114.9%	74.1%	129.8%	214.1%	226.7%	123.3%
burstiness_mid	32.0%	69.2%	81.0%	110.0%	163.2%	180.4%	128.5%
throughput_high	35.0%	49.7%	86.7%	52.1%	144.3%	63.3%	180.5%
throughput_low	109.6%	109.1%	109.9%	113.5%	117.2%	150.7%	111.0%

Summary

To summarize the experimental results, the following conclusions can be drawn:

Experiment 1: vLLM is the most suitable library for gpt-oss-120b inference.
Experiment 2: The nvidia/gpt-oss-120b-Eagle3-throughput model is the most compatible EAGLE-3 model for gpt-oss-120b.
Experiment 3: gpt-oss-120b is accelerated by EAGLE-3 under conditions of moderate access frequency and long input texts.

Speculative decoding, including EAGLE-3, is a complex technique where performance can either improve or degrade depending on the model, settings, and usage scenario. We intend to continue monitoring new developments in this technology.

About us

IR

Accelerating Inference of gpt-oss-120b with EAGLE-3

Overview

Methodology

Speculative Decoding

EAGLE-3

gpt-oss-120b

Prior Cases

Experiments: Overview

Experiment 1: Libraries

Results

Experiment 2: Models

Results

Experiment 3: Datasets and Access Patterns

Results

Summary

Author

kota.iizuka