请使用下面的模版提问(创建话题后勾选相应的选项):
Jetson 模组
[*] Jetson AGX Orin
Jetson Orin NX
Jetson Orin Nano
Jetson AGX Xavier
Jetson Xavier NX
Jetson TX 系列
Jetson Nano
Jetson 软件
JetPack 5.1.3
JetPack 5.1.4
JetPack 6.0
JetPack 6.1
[*] JetPack 6.2
DeepStream SDK
NVIDIA Isaac
SDK Manager 管理工具版本
2.3.0
2.2.0
2.1.0
[*] 其他
问题描述
把这里替换为您的问题描述/复现步骤/用例/图片的详细信息
Hello NVIDIA team,
We are working on deploying a local multimodal RAG pipeline on NVIDIA Jetson AGX Orin 64GB.
Current system environment:
-
Device: NVIDIA Jetson AGX Orin 64GB
-
Architecture: aarch64 / ARM64
-
L4T: R36.4.7
-
JetPack: 6.2.1
-
CUDA Toolkit: 12.6
-
TensorRT:
10.3.0.30 -
cuDNN: 9.3
-
Python: 3.10
-
PyTorch: 2.10.0
-
Transformers: 4.57.6
-
Triton Server: 2.49.0, TensorRT backend works
-
TensorRT-LLM: not installed
What already works:
-
Qwen3-Embedding-0.6B
- Exported to ONNX
- Built TensorRT engine
- Deployed on Triton TensorRT backend
- Verified FP32 output, no NaN
-
Qwen3-Reranker-0.6B
- Exported to ONNX
- Built TensorRT engine
- Deployed on Triton TensorRT backend
- Verified output shape [batch, 1]
-
Qwen3-VL-2B-Instruct
- Runs locally with PyTorch CUDA on Jetson
- Used as image captioner/compliance checker
- Works, but not accelerated by TensorRT-LLM yet
Target models we want to deploy with TensorRT-LLM:
-
Qwen/Qwen3-VL-2B-Instruct
- Architecture: Qwen3VLForConditionalGeneration
- Multimodal model: Vision Encoder + LLM Decoder
- model_type: qwen3_vl
-
Qwen/Qwen3.5-2B
- Architecture: Qwen3_5ForConditionalGeneration
- Multimodal model: Vision Encoder + LLM Decoder
- model_type: qwen3_5
- Text part includes linear_attention / Gated DeltaNet style layers
According to the latest TensorRT-LLM source and documentation, Qwen3VLForConditionalGeneration appears to be supported in the newer TensorRT-LLM PyTorch backend.
However, the current Jetson-compatible TensorRT-LLM branch, such as v0.12.0-jetson, seems to match TensorRT 10.3 but does not support Qwen3-VL or Qwen3.5.
The newer TensorRT-LLM main branch seems to require a newer software stack, approximately:
- CUDA 13.x
- TensorRT 10.15.x
- cuda-python >= 13
- torch >= 2.10.0, <= 2.11.0a0
- transformers == 5.5.4
- nvidia-modelopt[torch] ~= 0.37.0
- triton == 3.6.0
- flashinfer-python == 0.6.11.post1
Questions:
-
On Jetson AGX Orin with L4T R36.4.7 / JetPack 6.2.1, is it officially supported or technically possible to manually install CUDA 13.x and TensorRT 10.15.x?
-
If yes, is there a recommended installation procedure for Jetson AGX Orin?
-
Is there a Jetson-compatible TensorRT-LLM version or wheel that supports Qwen3VLForConditionalGeneration?
-
Is Qwen/Qwen3-VL-2B-Instruct deployment with TensorRT-LLM officially supported on Jetson AGX Orin?
-
Is Qwen/Qwen3.5-2B deployment with TensorRT-LLM supported on Jetson AGX Orin?
-
If direct TensorRT-LLM deployment is not currently supported, what is the recommended deployment path for these multimodal models on Jetson Orin?
- PyTorch backend?
- vLLM/SGLang?
- wait for future JetPack/TensorRT-LLM release?
- Docker container?
- build TensorRT-LLM from source?
-
If manual mixing of CUDA/TensorRT versions is not recommended, please confirm the reason, especially regarding driver/L4T compatibility and TensorRT ABI compatibility.
We would appreciate any official guidance or reference workflow.
错误码
把这里替换为错误码(无需其他信息)
错误日志
把这里替换,粘贴错误日志文本(尽量粘贴错误文本,不要只上传截图)
如果有多个日志,请使用多个代码格式化文本