Jetson 模组
[
] Jetson Thor
Jetson 软件
JetPack 5.1.3
JetPack 5.1.4
JetPack 6.0
JetPack 6.1
[
] JetPack 6.2
DeepStream SDK
NVIDIA Isaac
SDK Manager 管理工具版本
2.3.0
2.2.0
2.1.0
[
] 其他 (未使用 SDK Manager,直接刷机)
问题描述
硬件:Jetson AGX Thor Developer Kit,128GB 统一内存。
电源模式:已开启 MAXN (130W) 并执行 sudo jetson_clocks。
软件环境:使用官方 vLLM NGC 容器 nvcr.io/nvidia/vllm:25.09-py3。
模型:Qwen3-8B-AWQ(也尝试过 FP8 、BF16版本),放置在挂载目录 /models 下。
预期行为:
根据 NVIDIA 官方 MLPerf 数据和社区评测,8B 量化模型在 Thor 上的单请求生成吞吐量应达到 10-15 tokens/s(BF16 基线)或更高,GPU 功率应达到 70W 左右。
实际行为:
- 服务能正常启动,
/v1/chat/completions返回 200 OK。 - 但生成速度极低:
Avg generation throughput稳定在 4-5 tokens/s。 nvidia-smi显示 GPU 利用率约 97%,但 GPU 功率仅 23W(远低于满载 70-75W)。- vLLM 日志中
GPU KV cache usage: 0.0%,Prefix cache hit rate: 0.0%。 - 已尝试调整
--gpu-memory-utilization(0.5~0.9)、--max-model-len(2048~8192)等参数,均无改善。 - 已升级容器内
transformers库,无效。 - 已在主机执行
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches清理内存,无效。 - 已确认
torch.cuda.is_available()返回 True。 - 容器启动参数:
docker run --runtime=nvidia --gpus all --network host --ipc=host -v /home/xxx/LLM:/models nvcr.io/nvidia/vllm:25.09-py3 bash
复现步骤:
- 启动容器,进入 bash。
- 执行
vllm serve /models/Qwen3-8B-AWQ --port 8000 --quantization awq --gpu-memory-utilization 0.8 --max-model-len 4096 - 在主机终端发送测试请求:
curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "/models/Qwen3-8B-AWQ", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}' - 观察 vLLM 日志中的
Avg generation throughput和nvidia-smi中的功率/利用率。
附件
nvidia-smi输出截图(显示功率 23W,利用率 97%)tegrastats输出截图(显示 VDD_GPU 约 23W)- vLLM 日志片段:
Avg generation throughput: 4.3 tokens/s, GPU KV cache usage: 0.0%
错误码
无明确错误码,性能不符合预期。
错误日志
(粘贴 vLLM 服务启动和请求处理时的日志,例如:)
(APIServer pid=3291) INFO: Started server process [3291]
(APIServer pid=3291) INFO: Waiting for application startup.
(APIServer pid=3291) INFO: Application startup complete.
(APIServer pid=3291) INFO 04-10 19:29:54 [chat_utils.py:470] Detected the chat template content format to be ‘string’. You can set --chat-template-content-format to override this.
(APIServer pid=3291) INFO 04-10 19:29:54 [loggers.py:123] Engine 000: Avg prompt throughput: 2.1 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:04 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:14 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:24 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:34 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:44 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:30:54 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:31:04 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:31:14 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:31:24 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:31:34 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3291) INFO 04-10 19:31:44 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%