jeston agx thor 本地部署qwen3和qwen3.5生成速度极慢以及gpu功率很低的问题

Jetson 模组
[:white_check_mark:] Jetson Thor

Jetson 软件
JetPack 5.1.3
JetPack 5.1.4
JetPack 6.0
JetPack 6.1
[:white_check_mark:] JetPack 6.2
DeepStream SDK
NVIDIA Isaac

SDK Manager 管理工具版本
2.3.0
2.2.0
2.1.0
[:white_check_mark:] 其他 (未使用 SDK Manager,直接刷机)

问题描述

硬件:Jetson AGX Thor Developer Kit,128GB 统一内存。
电源模式:已开启 MAXN (130W) 并执行 sudo jetson_clocks
软件环境:使用官方 vLLM NGC 容器 nvcr.io/nvidia/vllm:25.09-py3
模型:Qwen3-8B-AWQ(也尝试过 FP8 、BF16版本),放置在挂载目录 /models 下。

预期行为
根据 NVIDIA 官方 MLPerf 数据和社区评测,8B 量化模型在 Thor 上的单请求生成吞吐量应达到 10-15 tokens/s(BF16 基线)或更高,GPU 功率应达到 70W 左右。

实际行为

  • 服务能正常启动,/v1/chat/completions 返回 200 OK。
  • 但生成速度极低:Avg generation throughput 稳定在 4-5 tokens/s。
  • nvidia-smi 显示 GPU 利用率约 97%,但 GPU 功率仅 23W(远低于满载 70-75W)。
  • vLLM 日志中 GPU KV cache usage: 0.0%Prefix cache hit rate: 0.0%
  • 已尝试调整 --gpu-memory-utilization (0.5~0.9)、--max-model-len (2048~8192)等参数,均无改善。
  • 已升级容器内 transformers 库,无效。
  • 已在主机执行 sync && echo 3 | sudo tee /proc/sys/vm/drop_caches 清理内存,无效。
  • 已确认 torch.cuda.is_available() 返回 True。
  • 容器启动参数:docker run --runtime=nvidia --gpus all --network host --ipc=host -v /home/xxx/LLM:/models nvcr.io/nvidia/vllm:25.09-py3 bash

复现步骤

  1. 启动容器,进入 bash。
  2. 执行 vllm serve /models/Qwen3-8B-AWQ --port 8000 --quantization awq --gpu-memory-utilization 0.8 --max-model-len 4096
  3. 在主机终端发送测试请求:curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "/models/Qwen3-8B-AWQ", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'
  4. 观察 vLLM 日志中的 Avg generation throughputnvidia-smi 中的功率/利用率。

附件

  • nvidia-smi 输出截图(显示功率 23W,利用率 97%)
  • tegrastats 输出截图(显示 VDD_GPU 约 23W)
  • vLLM 日志片段:Avg generation throughput: 4.3 tokens/s, GPU KV cache usage: 0.0%

错误码
无明确错误码,性能不符合预期。

错误日志
(粘贴 vLLM 服务启动和请求处理时的日志,例如:)
(APIServer pid=3291) INFO: Started server process [3291]
(APIServer pid=3291) INFO: Waiting for application startup.
(APIServer pid=3291) INFO: Application startup complete.
(APIServer pid=3291) INFO 04-10 19:29:54 [chat_utils.py:470] Detected the chat template content format to be ‘string’. You can set --chat-template-content-format to override this.
(APIServer pid=3291) INFO 04-10 19:29:54 [loggers.py:123] Engine 000: Avg prompt throughput: 2.1 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:04 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:14 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:24 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:34 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:44 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:54 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:31:04 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:31:14 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:31:24 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:31:34 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:31:44 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%

在使用qwen3-8B-FP8或者qwen-27B-FP8等量化模型时,向vlll进行简单上下文的单请求经常容易卡住,显示生成速度为0token/s;
不是我说,小白真难用啊