[Jetson AGX Orin] Can we manually install CUDA 13 / TensorRT 10.15 / TensorRT-LLM for Qwen3-VL?

请使用下面的模版提问(创建话题后勾选相应的选项):
Jetson 模组
[*] Jetson AGX Orin
Jetson Orin NX
Jetson Orin Nano
Jetson AGX Xavier
Jetson Xavier NX
Jetson TX 系列
Jetson Nano

Jetson 软件
JetPack 5.1.3
JetPack 5.1.4
JetPack 6.0
JetPack 6.1
[*] JetPack 6.2
DeepStream SDK
NVIDIA Isaac

SDK Manager 管理工具版本
2.3.0
2.2.0
2.1.0
[*] 其他

问题描述
把这里替换为您的问题描述/复现步骤/用例/图片的详细信息
Hello NVIDIA team,

We are working on deploying a local multimodal RAG pipeline on NVIDIA Jetson AGX Orin 64GB.

Current system environment:

  • Device: NVIDIA Jetson AGX Orin 64GB

  • Architecture: aarch64 / ARM64

  • L4T: R36.4.7

  • JetPack: 6.2.1

  • CUDA Toolkit: 12.6

  • TensorRT:
    10.3.0.30

  • cuDNN: 9.3

  • Python: 3.10

  • PyTorch: 2.10.0

  • Transformers: 4.57.6

  • Triton Server: 2.49.0, TensorRT backend works

  • TensorRT-LLM: not installed

What already works:

  1. Qwen3-Embedding-0.6B

    • Exported to ONNX
    • Built TensorRT engine
    • Deployed on Triton TensorRT backend
    • Verified FP32 output, no NaN
  2. Qwen3-Reranker-0.6B

    • Exported to ONNX
    • Built TensorRT engine
    • Deployed on Triton TensorRT backend
    • Verified output shape [batch, 1]
  3. Qwen3-VL-2B-Instruct

    • Runs locally with PyTorch CUDA on Jetson
    • Used as image captioner/compliance checker
    • Works, but not accelerated by TensorRT-LLM yet

Target models we want to deploy with TensorRT-LLM:

  1. Qwen/Qwen3-VL-2B-Instruct

    • Architecture: Qwen3VLForConditionalGeneration
    • Multimodal model: Vision Encoder + LLM Decoder
    • model_type: qwen3_vl
  2. Qwen/Qwen3.5-2B

    • Architecture: Qwen3_5ForConditionalGeneration
    • Multimodal model: Vision Encoder + LLM Decoder
    • model_type: qwen3_5
    • Text part includes linear_attention / Gated DeltaNet style layers

According to the latest TensorRT-LLM source and documentation, Qwen3VLForConditionalGeneration appears to be supported in the newer TensorRT-LLM PyTorch backend.

However, the current Jetson-compatible TensorRT-LLM branch, such as v0.12.0-jetson, seems to match TensorRT 10.3 but does not support Qwen3-VL or Qwen3.5.

The newer TensorRT-LLM main branch seems to require a newer software stack, approximately:

  • CUDA 13.x
  • TensorRT 10.15.x
  • cuda-python >= 13
  • torch >= 2.10.0, <= 2.11.0a0
  • transformers == 5.5.4
  • nvidia-modelopt[torch] ~= 0.37.0
  • triton == 3.6.0
  • flashinfer-python == 0.6.11.post1

Questions:

  1. On Jetson AGX Orin with L4T R36.4.7 / JetPack 6.2.1, is it officially supported or technically possible to manually install CUDA 13.x and TensorRT 10.15.x?

  2. If yes, is there a recommended installation procedure for Jetson AGX Orin?

  3. Is there a Jetson-compatible TensorRT-LLM version or wheel that supports Qwen3VLForConditionalGeneration?

  4. Is Qwen/Qwen3-VL-2B-Instruct deployment with TensorRT-LLM officially supported on Jetson AGX Orin?

  5. Is Qwen/Qwen3.5-2B deployment with TensorRT-LLM supported on Jetson AGX Orin?

  6. If direct TensorRT-LLM deployment is not currently supported, what is the recommended deployment path for these multimodal models on Jetson Orin?

    • PyTorch backend?
    • vLLM/SGLang?
    • wait for future JetPack/TensorRT-LLM release?
    • Docker container?
    • build TensorRT-LLM from source?
  7. If manual mixing of CUDA/TensorRT versions is not recommended, please confirm the reason, especially regarding driver/L4T compatibility and TensorRT ABI compatibility.

We would appreciate any official guidance or reference workflow.
错误码
把这里替换为错误码(无需其他信息)

错误日志
把这里替换,粘贴错误日志文本(尽量粘贴错误文本,不要只上传截图)
如果有多个日志,请使用多个代码格式化文本