arrow-up icon

Jetson AGX Orin vs AGX Thor: The Two Decisions You Must Lock Before BOM freeze

Avatar
Changgyu Choi |May 18, 2026 | AI Engineering Edge Robotics

This article was written with the assistance of AI.

Why BOM Freeze Is the Last Low-Cost Decision Point

During early development, your perception team’s VLM roadmap is taking shape — and the memory requirements are beginning to exceed what the Orin NX in your preliminary BOM can support. Upgrading to a higher-end platform could solve the issue, but switching at this stage is not a simple component change: it means redesigning the carrier board, thermal envelope, power delivery, software stack, and validation plan together. The hardware isn’t frozen yet. But the window to change it without a full redesign is closing.

That is why the platform choice must be evaluated against the 18–24-month production roadmap, not the current MVP scope. Deferring either decision converts a planning choice into a rework event.

Platform transition — Jetson lineup (2025–2026)

Jetson AGX Orin (Ampere architecture, 2022): CNN-centric edge AI platform, 15–60 W configurable, up to 275 sparse INT8 TOPS at MAXN[1].

Jetson AGX Thor (Blackwell architecture, 2025): Physical AI platform for VLM/VLA-enabled robotics, humanoid control, and on-device LLM inference. Dev kit GA: August 2025; T5000 production module GA August 2025, available from worldwide distribution partners[2].

Scope note. This guide covers on-device inference for the Orin and AGX Thor families. Cloud vs. edge evaluation is out of scope.

TL;DR — Two decisions, one window:

  • Decision 1 (platform): Commit to AGX Thor if the 2-year roadmap includes VLM, VLA, or on-device LLM; otherwise Orin. Getting this wrong triggers carrier board redesign, JetPack migration, and re-certification — see Annotation B for cost estimate derivation.
  • Decision 2 (optimization): Budget 4–8 engineer-weeks per platform generation for TensorRT pipeline work before hardware integration. Without it, production latency will exceed theoretical estimates by a margin that surfaces at the worst moment in the program.

Decision 1 — Platform: Jetson AGX Orin or AGX Thor

Physical AI workloads, as used throughout this article: on-device LLM, VLM, and VLA inference. VLM = Vision-Language Model; VLA = Vision-Language-Action; LLM = text-only Large Language Model.

Decision AxisChoose OrinChoose AGX Thor
WorkloadCNN perception (detection, segmentation, tracking) [1]Physical AI workloads — on-device LLM, VLM, or VLA — now or within program roadmap [2]
ConcurrencySingle or few models; Orin unified memory sufficientMultiple concurrent models (perception + language + planning); Orin memory is binding constraint
Thermal15–60 W configurable; hard cap below ~40 W [1]40–130 W configurable [2]
CostProduction module from $1,599 (1,000-unit pricing; verify current pricing with distributor). Dominant constraint at 200+ unit volumeProduction module from $2,999 (1,000-unit pricing).
MaturityMature platform; production-grade tooling availableDev kit and T5000 module GA August 2025; JetPack 7.0 production release August 2025, JetPack 7.1 released January 2026; toolchain stabilizing — verify TensorRT-LLM and Transformer Engine version compatibility for your target use case.
Engine assetsSubstantial TensorRT calibration investment to preserveStarting fresh; no Orin engine debt

Hybrid trigger. Choose a hybrid approach when some subsystems require Physical AI workloads, while others can be handled with CNN-only processing. This is especially appropriate when certification requirements or unit-cost considerations differ significantly across subsystems.

The transition risk. Run the platform decision against the 2-year workload roadmap, not the current MVP. A program selecting AGX Orin 64GB in 2026 for a workload requiring VLM inference by 2028 is scheduling its own hardware-refresh cycle.

Interactive — Platform Selection Decision Tree
Step 1 of 3

Question 1 of 3

Does the 18–24-month program roadmap include VLM, VLA, or on-device LLM?

Question 2 of 3

Is the system thermal cap below 40 W?

Question 3 of 3

Are workload classes mixed across subsystems? (CNN perception + physical AI co-deployed)

Recommendation

→ Jetson AGX Orin

Next Action
Budget 4–8 engineer-weeks for TensorRT INT8 calibration before hardware integration.
Recommendation

→ Jetson AGX Thor

Next Action
Validate JetPack 7 toolchain readiness; plan a parallel build pipeline alongside Orin.
Recommendation

→ Hybrid Architecture

Next Action
Scope dual carrier-board design, inter-module communication overhead, and additive certification cost.

The Jetson AGX Orin Architecture: Ampere for Real-Time Perception

NVIDIA's Jetson AGX Orin product page specifies the following [1]:

ModuleGPUDLAUnified MemoryMemory BWMax TDP
Jetson AGX Orin 64GB2048 CUDA + 64 Tensor Cores (Ampere)2× DLA 2.064 GB LPDDR5204.8 GB/s60 W (MAXN)
Jetson AGX Orin 32GB1792 CUDA + 56 Tensor Cores (Ampere)2× DLA 2.032 GB LPDDR5204.8 GB/s40 W (MAXN)
Jetson Orin NX 16GB1024 CUDA + 32 Tensor Cores (Ampere)1× DLA 2.016 GB LPDDR5102.4 GB/s25 W (MAXN)
Jetson Orin NX 8GB1024 CUDA + 32 Tensor Cores (Ampere)1× DLA 2.08 GB LPDDR5102.4 GB/s20 W (MAXN)

All figures at MAXN (maximum performance) power mode. Orin supports lower configurable TDP modes; verify configurable envelope against the current Jetson AGX Orin Series datasheet before finalizing thermal design [1].

The DLA (Deep Learning Accelerator) handles convolutions, pooling, activations, and batch normalization at higher energy efficiency than the GPU. Running DLA for the backbone and GPU for detection heads concurrently maximizes INT8 throughput per watt. To identify which layers fall back from DLA to GPU:

trtexec --dumpLayerInfo

Profile on target hardware before committing to the DLA+GPU parallel execution pattern. [Annotation B]

The Jetson AGX Thor Architecture: Blackwell for Physical AI Workloads

AGX Thor's value proposition is not faster CNNs — it is physical AI workloads that previously did not fit on Orin: large VLMs, multi-billion-parameter LLMs, and VLA policies.

ModuleGPUUnified MemoryMemory BWTDP Range
AGX Thor dev kitBlackwell GPU, Transformer Engine, up to 2070 FP4 TFLOPS sparse [2]128 GB LPDDR5X273 GB/s40–130 W
T5000 (production module)Blackwell GPU, same GPU class as dev kit [2]128 GB LPDDR5X273 GB/s40–130 W
MetricAGX Orin 64AGX ThorRatio
Peak compute (sparse)275 sparse INT8 TOPS (MAXN, 60 W) [1]Up to 2070 sparse FP4 TFLOPS (130 W) [2]~7.5× (not like-for-like)
Dense-to-dense (planning estimate only)~138 INT8-equiv. TOPS~517 INT8-equiv. TOPS~2–4× (see Annotation C)
Memory bandwidth204.8 GB/s [1]273 GB/s [2]~1.33× vs. AGX Orin 64

On the 7.5× headline — not like-for-like. NVIDIA's stated ~7.5× compares peak sparse FP4 on AGX Thor against peak sparse INT8 on AGX Orin [2]. A planning estimate for dense-to-dense compute uplift is ~2–4× (planning estimate only — see Annotation C). NVIDIA has not published a like-for-like dense figure for AGX Thor as of Q1 2026.

AGX Thor delivers five architectural advances over Orin 64 [2][3]:

  • Blackwell GPU with Transformer Engine. Narrows the CNN-vs-transformer latency gap relevant to VLM/VLA workloads.
  • 128 GB LPDDR5X unified memory. 2× capacity vs. AGX Orin 64. Enables model sizes and concurrent workloads infeasible on Orin NX 16GB.
  • Arm Neoverse-V3AE CPU. Generational step up from Cortex-A78AE on Orin.
  • Expanded power envelope (40–130 W configurable). 40 W keeps AGX Thor viable for mobile robotics; 130 W exposes full physical AI throughput [2].
  • Native FP4 via Transformer Engine. First Jetson to execute FP4 natively in hardware [2].

Model Architecture Choices: CNN vs. Transformer

On Orin. At standard robotics resolutions (640×480 and above), multi-head self-attention is bandwidth-bound on Ampere. Attention's QKV matmuls and softmax have data dependencies that limit layer fusion opportunities in TensorRT, and the memory-bound softmax and LayerNorm steps dominate latency without a dedicated Transformer Engine. CNN architectures — YOLOv8, EfficientDet, RT-DETR with CNN backbone — typically deliver lower latency at equivalent accuracy for real-time perception on Orin.

On AGX Thor. The Blackwell Transformer Engine substantially narrows the CNN-vs-transformer latency gap. ViT perception models, VLMs, and VLA policies are where AGX Thor's compute and bandwidth advantages show up most clearly [2][3]. If your roadmap includes replacing a CNN stack with a multimodal transformer or VLA policy within 18 months, AGX Thor is the right design target now — even if the first-generation deployment runs a CNN.

Model size requires full conditions, not a parameter count. "7B model" is incomplete without precision, context length, batch size, and concurrent workload count. At FP16, a 7B model requires approximately 14 GB for weights alone, before KV-cache and activations. At INT4 weight-only quantization, the same model compresses to approximately 3.5–4 GB for weights.


Decision 2 — Optimization Investment: Cost and Schedule Impact

On paper, the gap between theoretical and production latency looks like a tuning problem. In practice, it is a budget and schedule risk that surfaces at the worst moment in the program — hardware integration.

The business case is asymmetric: planned optimization cost is bounded. Deferred cost scales with when the problem is discovered.

The Production Latency Gap

End-to-end pipeline latency on Jetson-class hardware consistently exceeds GPU compute ceilings. Preprocessing, scheduling, memory movement, and kernel launch overhead are additive costs outside the inference engine — and they are not visible in GFLOPs-derived estimates.

One community-measured data point illustrates the magnitude: YOLOv8s on Jetson Orin NX 16GB measured 7.94 ms end-to-end — FP16, 640×640 input, batch 1, full pre/postprocessing pipeline included — against a GFLOPs-derived theoretical estimate of ~1.1 ms. Approximately a 7× ratio for this specific configuration. [7] NVIDIA's MLPerf figures for ResNet-50 (0.64 ms INT8) and RetinaNet (11.67 ms INT8) on AGX Orin represent single-model optimized floors under MAXN conditions, not production pipeline estimates. [4]

Interactive — Theoretical vs. Measured Latency Gap

Gap between theoretical and measured latency = your optimization budget. Hover bars for measurement conditions.

Model Precision Latency Type Conditions
YOLOv8s (theoretical) FP16 ~1.1 ms ⚠️ Do not use for planning GFLOPs-derived only [7]
YOLOv8s FP16 7.94 ms Full pipeline, community-measured Orin NX 16GB; pre/postprocessing included [7]
ResNet-50 INT8 0.64 ms MLPerf optimized floor AGX Orin; MAXN; single-model [4]
RetinaNet INT8 11.67 ms MLPerf optimized floor AGX Orin; MAXN; single-model [4]

The gap between the optimized floor and full pipeline latency is your optimization budget.

Three factors consistently widen this gap on Jetson:

  • Bandwidth as binding constraint. Transformer-heavy and large-activation workloads are memory-bandwidth-bound. AGX Thor's 273 GB/s widens the envelope, but large VLMs will still saturate it at inference scale.
  • Unified memory contention. GPU and CPU share the same DRAM pool. CPU preprocessing draws from the same LPDDR5 as the inference engine — profile this contention with Nsight Systems before committing to a memory layout.
  • P99 latency, not mean. A single deadline miss that triggers an emergency stop is a safety event. Development-server mean latency consistently understates P99 exposure on Jetson.

Before optimizing anything: run a 60-second Nsight Systems capture. It reveals whether the bottleneck is bandwidth, compute, or pipeline stalls — and determines whether TensorRT calibration, memory layout changes, or CUDA stream restructuring is the highest-leverage fix.

Precision Selection and Pipeline Reference

The gap is real. The next question is which precision to target to close it.

The decision comes down to workload:

  • CNN perception only → INT8, Orin or Thor
  • LLM/VLM on-device inference → FP4, AGX Thor only

CNN Perception (Detection / Segmentation / Tracking)

Precision Orin Thor Note
INT8 ✓ Recommended [1] [2] Production default for perception stacks
FP16 [1] [2] Use when INT8 accuracy loss is unacceptable

LLM / VLM Inference

Precision Orin Thor Note
FP16 [1] [2] Orin: 7B model, batch 1, no concurrent workload only
INT4 (weight-only) [1] [2] Orin: validate on target hardware; dequantization overhead applies
FP8 [2] Boundary layers; transformer middle ground
FP4 ✓ Recommended [2] Transformer body; AGX Thor only via Blackwell Transformer Engine

Concurrent stack note. Running CNN and LLM/VLM simultaneously on AGX Thor: use INT8 for the perception stack and FP4 for the transformer body. Profile unified memory contention with Nsight Systems — both workloads draw from the same LPDDR5X pool.


Act Before BOM Freeze

Two decisions share the same window — and both must be resolved before BOM Freeze.

Decision 1 — Platform: Commit to AGX Thor if the 2-year roadmap includes VLM, VLA, or on-device LLM. Otherwise Orin. Getting this wrong triggers carrier board redesign, JetPack migration, and re-certification — $120K–$320K and a schedule slip that compounds across every affected component.

Decision 2 — Optimization: Budget 4–8 engineer-weeks for TensorRT pipeline build, INT8/FP4 calibration, and CUDA stream profiling before hardware integration. At 200+ unit production volume, a single hardware tier reduction enabled by optimization typically covers the full $24K–$80K investment.

Before BOM Freeze, three actions are required:

  • Profile your 2-year workload roadmap against the CNN-only vs. LLM/VLM criterion — not the first-release scope.
  • Validate module selection against optimized INT8 benchmarks for Orin or FP4/FP8 benchmarks for AGX Thor — not GFLOPs-derived estimates.
  • Run a 60-second Nsight Systems capture on your current pipeline to identify the binding bottleneck.

Deferring any of these converts a planning choice into a rework event.All cost figures are planning estimates. See Annotation D for full derivation.

Annotations

Annotation A — AGX Orin 32GB vs. 64GB: practical differences

The 32GB module is a reduced configuration relative to the 64GB: 1792 CUDA cores vs. 2048, max GPU frequency 930 MHz vs. 1.3 GHz, 8-core vs. 12-core CPU, 40 W vs. 60 W max TDP, and 200 vs. 275 INT8 TOPS at MAXN [1]. The 32GB module is appropriate for systems with a hard 40 W thermal ceiling and does not deliver the 64GB’s MAXN-equivalent throughput.

Annotation B — DLA layer identification

To identify DLA layers that fall back to GPU: trtexec --dumpLayerInfo. Profile on target hardware before committing to the DLA+GPU parallel execution pattern. AGX Thor does not include a DLA [2]; Orin DLA offload paths have no direct equivalent on AGX Thor — validate CNN throughput on AGX Thor target hardware independently [5].

Annotation C — Dense-to-dense derivation (planning estimate only)

2070 sparse FP4 TFLOPS ÷ 2 = ~1035 dense FP4 TFLOPS. Treating Blackwell FP4 as ~2× INT8 per tensor core: 1035 ÷ 2 ≈ ~517 INT8-equivalent TOPS. AGX Orin 64: 275 sparse INT8 TOPS ÷ 2 = ~138 INT8-equivalent TOPS. Ratio: ~3.7×; "2–4×" range accommodates SM efficiency variation.

NVIDIA has not published a like-for-like dense figure for AGX Thor as of Q1 2026. Do not use this estimate as a throughput guarantee in production planning.

Annotation D — Cost estimate derivation (planning calculation, not market data)

No public standard rate card exists for this specific hardware redesign scenario. All figures are internal calculation-based estimates.

  • Optimization investment: 4–8 engineer-weeks × 40 hrs/week × $150–$250/hr = $24,000–$80,000.
  • Mid-program redesign (labor): 8–16 engineer-weeks × 40 hrs/week × $150–$250/hr = $48,000–$160,000.
  • Full redesign ($120,000–$320,000): Adds carrier-board redesign, re-certification, and integration testing overhead. Treat as order-of-magnitude planning figure only.
  • Rates are directional for embedded-GPU specialist labor in US/EU markets.

References

1.
NVIDIA — Jetson AGX Orin Product Page and Technical Specifications.
https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/
3.
4.
NVIDIA — Jetson Benchmarks (MLPerf Inference).
https://developer.nvidia.com/embedded/jetson-benchmarks
6.
7.
NVIDIA Developer Forums — "Understanding Real-World Latency vs. Theoretical Estimates on Jetson Orin NX for YOLOv8s."
https://forums.developer.nvidia.com/t/nderstanding-real-world-latency-vs-theoretical-estimates-on-jetson-orin-nx-for-yolov8s/308749

Author

Changgyu Choi
Changgyu Choi