This article was written with the assistance of AI.
Table of Contents
Why BOM Freeze Is the Last Low-Cost Decision Point
During early development, your perception team’s VLM roadmap is taking shape — and the memory requirements are beginning to exceed what the Orin NX in your preliminary BOM can support. Upgrading to a higher-end platform could solve the issue, but switching at this stage is not a simple component change: it means redesigning the carrier board, thermal envelope, power delivery, software stack, and validation plan together. The hardware isn’t frozen yet. But the window to change it without a full redesign is closing.
That is why the platform choice must be evaluated against the 18–24-month production roadmap, not the current MVP scope. Deferring either decision converts a planning choice into a rework event.
Platform transition — Jetson lineup (2025–2026)
Jetson AGX Orin (Ampere architecture, 2022): CNN-centric edge AI platform, 15–60 W configurable, up to 275 sparse INT8 TOPS at MAXN[1].
Jetson AGX Thor (Blackwell architecture, 2025): Physical AI platform for VLM/VLA-enabled robotics, humanoid control, and on-device LLM inference. Dev kit GA: August 2025; T5000 production module GA August 2025, available from worldwide distribution partners[2].
Scope note. This guide covers on-device inference for the Orin and AGX Thor families. Cloud vs. edge evaluation is out of scope.
TL;DR — Two decisions, one window:
- Decision 1 (platform): Commit to AGX Thor if the 2-year roadmap includes VLM, VLA, or on-device LLM; otherwise Orin. Getting this wrong triggers carrier board redesign, JetPack migration, and re-certification — see Annotation B for cost estimate derivation.
- Decision 2 (optimization): Budget 4–8 engineer-weeks per platform generation for TensorRT pipeline work before hardware integration. Without it, production latency will exceed theoretical estimates by a margin that surfaces at the worst moment in the program.
Decision 1 — Platform: Jetson AGX Orin or AGX Thor
Physical AI workloads, as used throughout this article: on-device LLM, VLM, and VLA inference. VLM = Vision-Language Model; VLA = Vision-Language-Action; LLM = text-only Large Language Model.
| Decision Axis | Choose Orin | Choose AGX Thor |
|---|---|---|
| Workload | CNN perception (detection, segmentation, tracking) [1] | Physical AI workloads — on-device LLM, VLM, or VLA — now or within program roadmap [2] |
| Concurrency | Single or few models; Orin unified memory sufficient | Multiple concurrent models (perception + language + planning); Orin memory is binding constraint |
| Thermal | 15–60 W configurable; hard cap below ~40 W [1] | 40–130 W configurable [2] |
| Cost | Production module from $1,599 (1,000-unit pricing; verify current pricing with distributor). Dominant constraint at 200+ unit volume | Production module from $2,999 (1,000-unit pricing). |
| Maturity | Mature platform; production-grade tooling available | Dev kit and T5000 module GA August 2025; JetPack 7.0 production release August 2025, JetPack 7.1 released January 2026; toolchain stabilizing — verify TensorRT-LLM and Transformer Engine version compatibility for your target use case. |
| Engine assets | Substantial TensorRT calibration investment to preserve | Starting fresh; no Orin engine debt |
Hybrid trigger. Choose a hybrid approach when some subsystems require Physical AI workloads, while others can be handled with CNN-only processing. This is especially appropriate when certification requirements or unit-cost considerations differ significantly across subsystems.
The transition risk. Run the platform decision against the 2-year workload roadmap, not the current MVP. A program selecting AGX Orin 64GB in 2026 for a workload requiring VLM inference by 2028 is scheduling its own hardware-refresh cycle.
Question 1 of 3
Does the 18–24-month program roadmap include VLM, VLA, or on-device LLM?
Question 2 of 3
Is the system thermal cap below 40 W?
Question 3 of 3
Are workload classes mixed across subsystems? (CNN perception + physical AI co-deployed)
→ Jetson AGX Orin
→ Jetson AGX Thor
→ Hybrid Architecture
The Jetson AGX Orin Architecture: Ampere for Real-Time Perception
NVIDIA's Jetson AGX Orin product page specifies the following [1]:
| Module | GPU | DLA | Unified Memory | Memory BW | Max TDP |
|---|---|---|---|---|---|
| Jetson AGX Orin 64GB | 2048 CUDA + 64 Tensor Cores (Ampere) | 2× DLA 2.0 | 64 GB LPDDR5 | 204.8 GB/s | 60 W (MAXN) |
| Jetson AGX Orin 32GB | 1792 CUDA + 56 Tensor Cores (Ampere) | 2× DLA 2.0 | 32 GB LPDDR5 | 204.8 GB/s | 40 W (MAXN) |
| Jetson Orin NX 16GB | 1024 CUDA + 32 Tensor Cores (Ampere) | 1× DLA 2.0 | 16 GB LPDDR5 | 102.4 GB/s | 25 W (MAXN) |
| Jetson Orin NX 8GB | 1024 CUDA + 32 Tensor Cores (Ampere) | 1× DLA 2.0 | 8 GB LPDDR5 | 102.4 GB/s | 20 W (MAXN) |
All figures at MAXN (maximum performance) power mode. Orin supports lower configurable TDP modes; verify configurable envelope against the current Jetson AGX Orin Series datasheet before finalizing thermal design [1].
The DLA (Deep Learning Accelerator) handles convolutions, pooling, activations, and batch normalization at higher energy efficiency than the GPU. Running DLA for the backbone and GPU for detection heads concurrently maximizes INT8 throughput per watt. To identify which layers fall back from DLA to GPU:
Profile on target hardware before committing to the DLA+GPU parallel execution pattern. [Annotation B]
The Jetson AGX Thor Architecture: Blackwell for Physical AI Workloads
AGX Thor's value proposition is not faster CNNs — it is physical AI workloads that previously did not fit on Orin: large VLMs, multi-billion-parameter LLMs, and VLA policies.
| Module | GPU | Unified Memory | Memory BW | TDP Range |
|---|---|---|---|---|
| AGX Thor dev kit | Blackwell GPU, Transformer Engine, up to 2070 FP4 TFLOPS sparse [2] | 128 GB LPDDR5X | 273 GB/s | 40–130 W |
| T5000 (production module) | Blackwell GPU, same GPU class as dev kit [2] | 128 GB LPDDR5X | 273 GB/s | 40–130 W |
| Metric | AGX Orin 64 | AGX Thor | Ratio |
|---|---|---|---|
| Peak compute (sparse) | 275 sparse INT8 TOPS (MAXN, 60 W) [1] | Up to 2070 sparse FP4 TFLOPS (130 W) [2] | ~7.5× (not like-for-like) |
| Dense-to-dense (planning estimate only) | ~138 INT8-equiv. TOPS | ~517 INT8-equiv. TOPS | ~2–4× (see Annotation C) |
| Memory bandwidth | 204.8 GB/s [1] | 273 GB/s [2] | ~1.33× vs. AGX Orin 64 |
On the 7.5× headline — not like-for-like. NVIDIA's stated ~7.5× compares peak sparse FP4 on AGX Thor against peak sparse INT8 on AGX Orin [2]. A planning estimate for dense-to-dense compute uplift is ~2–4× (planning estimate only — see Annotation C). NVIDIA has not published a like-for-like dense figure for AGX Thor as of Q1 2026.
AGX Thor delivers five architectural advances over Orin 64 [2][3]:
- Blackwell GPU with Transformer Engine. Narrows the CNN-vs-transformer latency gap relevant to VLM/VLA workloads.
- 128 GB LPDDR5X unified memory. 2× capacity vs. AGX Orin 64. Enables model sizes and concurrent workloads infeasible on Orin NX 16GB.
- Arm Neoverse-V3AE CPU. Generational step up from Cortex-A78AE on Orin.
- Expanded power envelope (40–130 W configurable). 40 W keeps AGX Thor viable for mobile robotics; 130 W exposes full physical AI throughput [2].
- Native FP4 via Transformer Engine. First Jetson to execute FP4 natively in hardware [2].
Model Architecture Choices: CNN vs. Transformer
On Orin. At standard robotics resolutions (640×480 and above), multi-head self-attention is bandwidth-bound on Ampere. Attention's QKV matmuls and softmax have data dependencies that limit layer fusion opportunities in TensorRT, and the memory-bound softmax and LayerNorm steps dominate latency without a dedicated Transformer Engine. CNN architectures — YOLOv8, EfficientDet, RT-DETR with CNN backbone — typically deliver lower latency at equivalent accuracy for real-time perception on Orin.
On AGX Thor. The Blackwell Transformer Engine substantially narrows the CNN-vs-transformer latency gap. ViT perception models, VLMs, and VLA policies are where AGX Thor's compute and bandwidth advantages show up most clearly [2][3]. If your roadmap includes replacing a CNN stack with a multimodal transformer or VLA policy within 18 months, AGX Thor is the right design target now — even if the first-generation deployment runs a CNN.
Model size requires full conditions, not a parameter count. "7B model" is incomplete without precision, context length, batch size, and concurrent workload count. At FP16, a 7B model requires approximately 14 GB for weights alone, before KV-cache and activations. At INT4 weight-only quantization, the same model compresses to approximately 3.5–4 GB for weights.
Decision 2 — Optimization Investment: Cost and Schedule Impact
On paper, the gap between theoretical and production latency looks like a tuning problem. In practice, it is a budget and schedule risk that surfaces at the worst moment in the program — hardware integration.
The business case is asymmetric: planned optimization cost is bounded. Deferred cost scales with when the problem is discovered.
The Production Latency Gap
End-to-end pipeline latency on Jetson-class hardware consistently exceeds GPU compute ceilings. Preprocessing, scheduling, memory movement, and kernel launch overhead are additive costs outside the inference engine — and they are not visible in GFLOPs-derived estimates.
One community-measured data point illustrates the magnitude: YOLOv8s on Jetson Orin NX 16GB measured 7.94 ms end-to-end — FP16, 640×640 input, batch 1, full pre/postprocessing pipeline included — against a GFLOPs-derived theoretical estimate of ~1.1 ms. Approximately a 7× ratio for this specific configuration. [7] NVIDIA's MLPerf figures for ResNet-50 (0.64 ms INT8) and RetinaNet (11.67 ms INT8) on AGX Orin represent single-model optimized floors under MAXN conditions, not production pipeline estimates. [4]
Gap between theoretical and measured latency = your optimization budget. Hover bars for measurement conditions.
| Model | Precision | Latency | Type | Conditions |
|---|---|---|---|---|
| YOLOv8s (theoretical) | FP16 | ~1.1 ms | ⚠️ Do not use for planning | GFLOPs-derived only [7] |
| YOLOv8s | FP16 | 7.94 ms | Full pipeline, community-measured | Orin NX 16GB; pre/postprocessing included [7] |
| ResNet-50 | INT8 | 0.64 ms | MLPerf optimized floor | AGX Orin; MAXN; single-model [4] |
| RetinaNet | INT8 | 11.67 ms | MLPerf optimized floor | AGX Orin; MAXN; single-model [4] |
The gap between the optimized floor and full pipeline latency is your optimization budget.
Three factors consistently widen this gap on Jetson:
- Bandwidth as binding constraint. Transformer-heavy and large-activation workloads are memory-bandwidth-bound. AGX Thor's 273 GB/s widens the envelope, but large VLMs will still saturate it at inference scale.
- Unified memory contention. GPU and CPU share the same DRAM pool. CPU preprocessing draws from the same LPDDR5 as the inference engine — profile this contention with Nsight Systems before committing to a memory layout.
- P99 latency, not mean. A single deadline miss that triggers an emergency stop is a safety event. Development-server mean latency consistently understates P99 exposure on Jetson.
Before optimizing anything: run a 60-second Nsight Systems capture. It reveals whether the bottleneck is bandwidth, compute, or pipeline stalls — and determines whether TensorRT calibration, memory layout changes, or CUDA stream restructuring is the highest-leverage fix.
Precision Selection and Pipeline Reference
The gap is real. The next question is which precision to target to close it.
The decision comes down to workload:
- CNN perception only → INT8, Orin or Thor
- LLM/VLM on-device inference → FP4, AGX Thor only
CNN Perception (Detection / Segmentation / Tracking)
| Precision | Orin | Thor | Note |
|---|---|---|---|
| INT8 | ✓ Recommended [1] | ✓ [2] | Production default for perception stacks |
| FP16 | ✓ [1] | ✓ [2] | Use when INT8 accuracy loss is unacceptable |
LLM / VLM Inference
| Precision | Orin | Thor | Note |
|---|---|---|---|
| FP16 | △ [1] | ✓ [2] | Orin: 7B model, batch 1, no concurrent workload only |
| INT4 (weight-only) | △ [1] | ✓ [2] | Orin: validate on target hardware; dequantization overhead applies |
| FP8 | ✗ | ✓ [2] | Boundary layers; transformer middle ground |
| FP4 | ✗ | ✓ Recommended [2] | Transformer body; AGX Thor only via Blackwell Transformer Engine |
Concurrent stack note. Running CNN and LLM/VLM simultaneously on AGX Thor: use INT8 for the perception stack and FP4 for the transformer body. Profile unified memory contention with Nsight Systems — both workloads draw from the same LPDDR5X pool.
Act Before BOM Freeze
Two decisions share the same window — and both must be resolved before BOM Freeze.
Decision 1 — Platform: Commit to AGX Thor if the 2-year roadmap includes VLM, VLA, or on-device LLM. Otherwise Orin. Getting this wrong triggers carrier board redesign, JetPack migration, and re-certification — $120K–$320K and a schedule slip that compounds across every affected component.
Decision 2 — Optimization: Budget 4–8 engineer-weeks for TensorRT pipeline build, INT8/FP4 calibration, and CUDA stream profiling before hardware integration. At 200+ unit production volume, a single hardware tier reduction enabled by optimization typically covers the full $24K–$80K investment.
Before BOM Freeze, three actions are required:
- Profile your 2-year workload roadmap against the CNN-only vs. LLM/VLM criterion — not the first-release scope.
- Validate module selection against optimized INT8 benchmarks for Orin or FP4/FP8 benchmarks for AGX Thor — not GFLOPs-derived estimates.
- Run a 60-second Nsight Systems capture on your current pipeline to identify the binding bottleneck.
Deferring any of these converts a planning choice into a rework event.All cost figures are planning estimates. See Annotation D for full derivation.
Annotations
The 32GB module is a reduced configuration relative to the 64GB: 1792 CUDA cores vs. 2048, max GPU frequency 930 MHz vs. 1.3 GHz, 8-core vs. 12-core CPU, 40 W vs. 60 W max TDP, and 200 vs. 275 INT8 TOPS at MAXN [1]. The 32GB module is appropriate for systems with a hard 40 W thermal ceiling and does not deliver the 64GB’s MAXN-equivalent throughput.
To identify DLA layers that fall back to GPU: trtexec --dumpLayerInfo. Profile on target hardware before committing to the DLA+GPU parallel execution pattern. AGX Thor does not include a DLA [2]; Orin DLA offload paths have no direct equivalent on AGX Thor — validate CNN throughput on AGX Thor target hardware independently [5].
2070 sparse FP4 TFLOPS ÷ 2 = ~1035 dense FP4 TFLOPS. Treating Blackwell FP4 as ~2× INT8 per tensor core: 1035 ÷ 2 ≈ ~517 INT8-equivalent TOPS. AGX Orin 64: 275 sparse INT8 TOPS ÷ 2 = ~138 INT8-equivalent TOPS. Ratio: ~3.7×; "2–4×" range accommodates SM efficiency variation.
NVIDIA has not published a like-for-like dense figure for AGX Thor as of Q1 2026. Do not use this estimate as a throughput guarantee in production planning.
No public standard rate card exists for this specific hardware redesign scenario. All figures are internal calculation-based estimates.
- Optimization investment: 4–8 engineer-weeks × 40 hrs/week × $150–$250/hr = $24,000–$80,000.
- Mid-program redesign (labor): 8–16 engineer-weeks × 40 hrs/week × $150–$250/hr = $48,000–$160,000.
- Full redesign ($120,000–$320,000): Adds carrier-board redesign, re-certification, and integration testing overhead. Treat as order-of-magnitude planning figure only.
- Rates are directional for embedded-GPU specialist labor in US/EU markets.
References
https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/
https://nvidianews.nvidia.com/news/nvidia-blackwell-powered-jetson-thor-now-available-accelerating-the-age-of-general-robotics
https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-thor/
https://developer.nvidia.com/embedded/jetson-benchmarks
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/
https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/
https://forums.developer.nvidia.com/t/nderstanding-real-world-latency-vs-theoretical-estimates-on-jetson-orin-nx-for-yolov8s/308749