jetson agx thor

903592775 · 2026 年3 月 4 日 02:34

(vllm35) nvidia@localhost:/data/lqy/vllm/qwne3.5$ vllm serve \

“/data/lqy/qwen3.5/Qwen3.5-9B”
–async-scheduling
–port 8000
–host 0.0.0.0
–trust-remote-code
–swap-space 16
–max-model-len 32768
–tensor-parallel-size 1
–max-num-seqs 256
–gpu-memory-utilization 0.7
–enable-prefix-caching
DEBUG 03-04 10:16:27 [plugins/init.py:36] No plugins for group vllm.platform_plugins found.
DEBUG 03-04 10:16:27 [platforms/init.py:36] Checking if TPU platform is available.
DEBUG 03-04 10:16:27 [platforms/init.py:55] TPU platform is not available because: No module named ‘libtpu’
DEBUG 03-04 10:16:27 [platforms/init.py:61] Checking if CUDA platform is available.
DEBUG 03-04 10:16:29 [platforms/init.py:84] Confirmed CUDA platform is available.
DEBUG 03-04 10:16:29 [platforms/init.py:112] Checking if ROCm platform is available.
DEBUG 03-04 10:16:29 [platforms/init.py:126] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 03-04 10:16:29 [platforms/init.py:133] Checking if XPU platform is available.
DEBUG 03-04 10:16:29 [platforms/init.py:155] Checking if CPU platform is available.
DEBUG 03-04 10:16:29 [platforms/init.py:61] Checking if CUDA platform is available.
DEBUG 03-04 10:16:29 [platforms/init.py:84] Confirmed CUDA platform is available.
DEBUG 03-04 10:16:29 [platforms/init.py:220] Automatically detected platform cuda.
DEBUG 03-04 10:16:31 [utils/flashinfer.py:45] flashinfer-cubin package was not found
DEBUG 03-04 10:16:32 [utils/import_utils.py:74] Loading module triton_kernels from /data/lqy/vllm/qwne3.5/vllm/third_party/triton_kernels/init.py.
DEBUG 03-04 10:16:33 [entrypoints/utils.py:171] Setting VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’
DEBUG 03-04 10:16:33 [plugins/init.py:44] Available plugins for group vllm.general_plugins:
DEBUG 03-04 10:16:33 [plugins/init.py:46] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 03-04 10:16:33 [plugins/init.py:46] - lora_hf_hub_resolver → vllm.plugins.lora_resolvers.hf_hub_resolver:register_hf_hub_resolver
DEBUG 03-04 10:16:33 [plugins/init.py:49] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
(APIServer pid=3978607) INFO 03-04 10:16:33 [entrypoints/utils.py:293]
(APIServer pid=3978607) INFO 03-04 10:16:33 [entrypoints/utils.py:293] █ █ █▄ ▄█
(APIServer pid=3978607) INFO 03-04 10:16:33 [entrypoints/utils.py:293] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.16.1rc1.dev175+gb8401cde0.d20260303
(APIServer pid=3978607) INFO 03-04 10:16:33 [entrypoints/utils.py:293] █▄█▀ █ █ █ █ model /data/lqy/qwen3.5/Qwen3.5-9B
(APIServer pid=3978607) INFO 03-04 10:16:33 [entrypoints/utils.py:293] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=3978607) INFO 03-04 10:16:33 [entrypoints/utils.py:293]
(APIServer pid=3978607) INFO 03-04 10:16:33 [entrypoints/utils.py:229] non-default args: {‘model_tag’: ‘/data/lqy/qwen3.5/Qwen3.5-9B’, ‘host’: ‘0.0.0.0’, ‘model’: ‘/data/lqy/qwen3.5/Qwen3.5-9B’, ‘trust_remote_code’: True, ‘max_model_len’: 32768, ‘gpu_memory_utilization’: 0.7, ‘swap_space’: 16.0, ‘enable_prefix_caching’: True, ‘max_num_seqs’: 256, ‘async_scheduling’: True}
(APIServer pid=3978607) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=3978607) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=3978607) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=3978607) DEBUG 03-04 10:16:33 [model_executor/models/registry.py:794] Loaded model info for class vllm.model_executor.models.qwen3_5.Qwen3_5ForConditionalGeneration from cache
(APIServer pid=3978607) DEBUG 03-04 10:16:33 [logging_utils/log_time.py:29] Registry inspect model class: Elapsed time 0.0013964 secs
(APIServer pid=3978607) INFO 03-04 10:16:33 [config/model.py:530] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=3978607) INFO 03-04 10:16:33 [config/model.py:1553] Using max model len 32768
(APIServer pid=3978607) DEBUG 03-04 10:16:33 [config/model.py:1618] Generative models support chunked prefill.
(APIServer pid=3978607) DEBUG 03-04 10:16:33 [config/model.py:1661] Hybrid models do not support prefix caching since the feature is still experimental.
(APIServer pid=3978607) DEBUG 03-04 10:16:33 [engine/arg_utils.py:2037] Enabling chunked prefill by default
(APIServer pid=3978607) DEBUG 03-04 10:16:33 [engine/arg_utils.py:2153] Defaulting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
(APIServer pid=3978607) INFO 03-04 10:16:33 [config/scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=3978607) WARNING 03-04 10:16:33 [model_executor/models/config.py:381] Mamba cache mode is set to ‘align’ for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=3978607) INFO 03-04 10:16:33 [model_executor/models/config.py:401] Warning: Prefix caching in Mamba cache ‘align’ mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=3978607) DEBUG 03-04 10:16:33 [compilation/decorators.py:202] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.qwen2_moe.Qwen2MoeModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
(APIServer pid=3978607) DEBUG 03-04 10:16:34 [compilation/decorators.py:202] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.qwen3_next.Qwen3NextModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
(APIServer pid=3978607) INFO 03-04 10:16:34 [model_executor/models/config.py:544] Setting attention block size to 528 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=3978607) INFO 03-04 10:16:34 [model_executor/models/config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=3978607) INFO 03-04 10:16:34 [config/vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=3978607) DEBUG 03-04 10:16:34 [plugins/init.py:36] No plugins for group vllm.stat_logger_plugins found.
(APIServer pid=3978607) DEBUG 03-04 10:16:34 [renderers/registry.py:51] Loading HfRenderer for renderer_mode=‘hf’
(APIServer pid=3978607) DEBUG 03-04 10:16:35 [tokenizers/registry.py:66] Loading CachedHfTokenizer for tokenizer_mode=‘hf’
(APIServer pid=3978607) DEBUG 03-04 10:16:35 [utils/torch_utils.py:119] OMP_NUM_THREADS is not set; defaulting Torch threads to 1.
(APIServer pid=3978607) DEBUG 03-04 10:16:35 [plugins/io_processors/init.py:37] No IOProcessor plugins requested by the model
(APIServer pid=3978607) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=3978607) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
DEBUG 03-04 10:16:43 [plugins/init.py:36] No plugins for group vllm.platform_plugins found.
DEBUG 03-04 10:16:43 [platforms/init.py:36] Checking if TPU platform is available.
DEBUG 03-04 10:16:43 [platforms/init.py:55] TPU platform is not available because: No module named ‘libtpu’
DEBUG 03-04 10:16:43 [platforms/init.py:61] Checking if CUDA platform is available.
DEBUG 03-04 10:16:43 [platforms/init.py:84] Confirmed CUDA platform is available.
DEBUG 03-04 10:16:43 [platforms/init.py:112] Checking if ROCm platform is available.
DEBUG 03-04 10:16:43 [platforms/init.py:126] ROCm platform is not available because: No module named ‘amdsmi’
DEBUG 03-04 10:16:43 [platforms/init.py:133] Checking if XPU platform is available.
DEBUG 03-04 10:16:43 [platforms/init.py:155] Checking if CPU platform is available.
DEBUG 03-04 10:16:43 [platforms/init.py:61] Checking if CUDA platform is available.
DEBUG 03-04 10:16:43 [platforms/init.py:84] Confirmed CUDA platform is available.
DEBUG 03-04 10:16:43 [platforms/init.py:220] Automatically detected platform cuda.
DEBUG 03-04 10:16:45 [utils/flashinfer.py:45] flashinfer-cubin package was not found
DEBUG 03-04 10:16:46 [utils/import_utils.py:74] Loading module triton_kernels from /data/lqy/vllm/qwne3.5/vllm/third_party/triton_kernels/init.py.
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:46 [v1/engine/core.py:1004] Waiting for init message from front-end.
(APIServer pid=3978607) DEBUG 03-04 10:16:46 [v1/engine/utils.py:1111] HELLO from local core engine process 0.
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:46 [v1/engine/core.py:1015] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=[‘ipc:///tmp/2b4a1902-4905-4ef0-8baa-4615bcaf3633’], outputs=[‘ipc:///tmp/716ae102-7ff0-49db-b43f-d9114b45d1bd’], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None), parallel_config={})
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:46 [v1/engine/core.py:812] Has DP Coordinator: False, stats publish address: None
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:46 [plugins/init.py:44] Available plugins for group vllm.general_plugins:
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:46 [plugins/init.py:46] - lora_filesystem_resolver → vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:46 [plugins/init.py:46] - lora_hf_hub_resolver → vllm.plugins.lora_resolvers.hf_hub_resolver:register_hf_hub_resolver
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:46 [plugins/init.py:49] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
(EngineCore_DP0 pid=3979203) INFO 03-04 10:16:46 [v1/engine/core.py:101] Initializing a V1 LLM engine (v0.16.1rc1.dev175+gb8401cde0.d20260303) with config: model=‘/data/lqy/qwen3.5/Qwen3.5-9B’, speculative_config=None, tokenizer=‘/data/lqy/qwen3.5/Qwen3.5-9B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/lqy/qwen3.5/Qwen3.5-9B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘level’: None, ‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘compile_sizes’: , ‘compile_ranges_split_points’: [2048], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: True, ‘static_all_moe_layers’: }
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:47 [compilation/decorators.py:202] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.qwen2_moe.Qwen2MoeModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:47 [compilation/decorators.py:202] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.qwen3_next.Qwen3NextModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:47 [tokenizers/registry.py:66] Loading CachedHfTokenizer for tokenizer_mode=‘hf’
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:48 [distributed/parallel_state.py:1348] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.10.75:45877 backend=nccl
(EngineCore_DP0 pid=3979203) INFO 03-04 10:16:48 [distributed/parallel_state.py:1392] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.10.75:45877 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:48 [distributed/parallel_state.py:1451] Detected 1 nodes in the distributed environment
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=3979203) INFO 03-04 10:16:48 [distributed/parallel_state.py:1714] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:48 [v1/worker/gpu_worker.py:285] worker init memory snapshot: torch_peak=0.0GiB, free_memory=115.89GiB, total_memory=122.82GiB, cuda_memory=6.94GiB, torch_memory=0.0GiB, non_torch_memory=6.94GiB, timestamp=1772590608.3469307, auto_measure=True
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:48 [v1/worker/gpu_worker.py:286] worker requested memory: 85.98GiB
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:48 [compilation/decorators.py:202] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.deepseek_v2.DeepseekV2Model’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:48 [compilation/decorators.py:202] Inferred dynamic dimensions for forward method of <class ‘vllm.model_executor.models.llama.LlamaModel’>: [‘input_ids’, ‘positions’, ‘intermediate_tensors’, ‘inputs_embeds’]
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:48 [v1/sample/ops/topk_topp_sampler.py:57] FlashInfer top-p/top-k sampling is available but disabled by default. Set VLLM_USE_FLASHINFER_SAMPLER=1 to opt in after verifying accuracy for your workloads.
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:48 [v1/sample/logits_processor/init.py:63] No logitsprocs plugins installed (group vllm.logits_processors).
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:48 [utils/torch_utils.py:119] OMP_NUM_THREADS is not set; defaulting Torch threads to 1.
(EngineCore_DP0 pid=3979203) INFO 03-04 10:16:54 [model_executor/offloader/base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=3979203) INFO 03-04 10:16:54 [v1/worker/gpu_model_runner.py:4250] Starting to load model /data/lqy/qwen3.5/Qwen3.5-9B…
(EngineCore_DP0 pid=3979203) INFO 03-04 10:16:54 [platforms/cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=3979203) INFO 03-04 10:16:54 [model_executor/…/attention/mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:55 [platforms/cuda.py:387] Some attention backends are not valid for cuda with AttentionSelectorConfig(head_size=256, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=False, has_sink=False, use_sparse=False, use_mm_prefix=False, use_per_head_quant_scales=False, attn_type=decoder). Reasons: {}.
(EngineCore_DP0 pid=3979203) INFO 03-04 10:16:55 [platforms/cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: [‘FLASH_ATTN’, ‘FLASHINFER’, ‘TRITON_ATTN’, ‘FLEX_ATTENTION’].
(EngineCore_DP0 pid=3979203) INFO 03-04 10:16:55 [v1/attention/backends/flash_attn.py:587] Using FlashAttention version 2
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:56 [compilation/backends.py:100] Using InductorStandaloneAdaptor
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:56 [config/compilation.py:1105] enabled custom ops: Counter({‘mm_encoder_attn’: 27, ‘apply_rotary_emb’: 27})
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:56 [config/compilation.py:1106] disabled custom ops: Counter({‘gemma_rms_norm’: 81, ‘silu_and_mul’: 32, ‘rms_norm_gated’: 24, ‘chunk_gated_delta_rule’: 24, ‘rotary_embedding’: 2, ‘apply_rotary_emb’: 2, ‘conv3d’: 1, ‘vocab_parallel_embedding’: 1, ‘parallel_lm_head’: 1, ‘logits_processor’: 1})
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:56 [model_executor/model_loader/base_loader.py:60] Loading weights on cuda …
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
(APIServer pid=3978607) DEBUG 03-04 10:16:56 [v1/engine/utils.py:1000] Waiting for 1 local, 0 remote core engine proc(s) to start.
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:16:57 [model_executor/models/utils.py:219] Loaded weight lm_head.weight with shape torch.Size([248320, 4096])
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:11, 3.91s/it]
(APIServer pid=3978607) DEBUG 03-04 10:17:06 [v1/engine/utils.py:1000] Waiting for 1 local, 0 remote core engine proc(s) to start.
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:17<00:18, 9.44s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:20<00:06, 6.61s/it]
(APIServer pid=3978607) DEBUG 03-04 10:17:16 [v1/engine/utils.py:1000] Waiting for 1 local, 0 remote core engine proc(s) to start.
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:23<00:00, 5.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:23<00:00, 5.99s/it]
(EngineCore_DP0 pid=3979203)
(EngineCore_DP0 pid=3979203) INFO 03-04 10:17:20 [model_executor/model_loader/default_loader.py:293] Loading weights took 24.19 seconds
(EngineCore_DP0 pid=3979203) DEBUG 03-04 10:17:20 [model_executor/model_loader/base_loader.py:68] Peak GPU memory after loading weights: 17.68 GiB
(EngineCore_DP0 pid=3979203) INFO 03-04 10:17:20 [v1/worker/gpu_model_runner.py:4333] Model loading took 17.66 GiB memory and 25.462008 seconds
(EngineCore_DP0 pid=3979203) INFO 03-04 10:17:21 [v1/worker/gpu_model_runner.py:5259] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core.py”, line 1090, in run_engine_core
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return func(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core.py”, line 834, in init
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] super().init(
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core.py”, line 120, in init
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return func(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core.py”, line 252, in _initialize_kv_caches
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/v1/executor/abstract.py”, line 136, in determine_available_memory
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return self.collective_rpc(“determine_available_memory”)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/v1/executor/uniproc_executor.py”, line 76, in collective_rpc
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/v1/serial_utils.py”, line 459, in run_method
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return func(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 124, in decorate_context
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return func(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/v1/worker/gpu_worker.py”, line 389, in determine_available_memory
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] self.model_runner.profile_run()
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/v1/worker/gpu_model_runner.py”, line 5275, in profile_run
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] dummy_encoder_outputs = self.model.embed_multimodal(
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/model_executor/models/qwen3_vl.py”, line 1938, in embed_multimodal
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] image_embeddings = self._process_image_input(multimodal_input)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/model_executor/models/qwen3_vl.py”, line 1456, in _process_image_input
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] image_embeds = self.visual(pixel_values, grid_thw=grid_thw)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1787, in _call_impl
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/model_executor/models/qwen3_vl.py”, line 533, in forward
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] hidden_states = self.patch_embed(hidden_states)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1787, in _call_impl
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/model_executor/models/qwen3_vl.py”, line 169, in forward
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] x = self.proj(x).view(L, self.hidden_size)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1787, in _call_impl
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/model_executor/custom_op.py”, line 129, in forward
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return self._forward_method(*args, **kwargs)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/model_executor/layers/conv.py”, line 250, in forward_native
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] return self._forward_mulmat(x)
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] File “/data/lqy/vllm/qwne3.5/vllm/model_executor/layers/conv.py”, line 226, in _forward_mulmat
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] x = F.linear(
(EngineCore_DP0 pid=3979203) ERROR 03-04 10:17:21 [v1/engine/core.py:1100] RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)
(EngineCore_DP0 pid=3979203) Process EngineCore_DP0:
(EngineCore_DP0 pid=3979203) Traceback (most recent call last):
(EngineCore_DP0 pid=3979203) File “/data/conda/envs/vllm35/lib/python3.10/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore_DP0 pid=3979203) self.run()
(EngineCore_DP0 pid=3979203) File “/data/conda/envs/vllm35/lib/python3.10/multiprocessing/process.py”, line 108, in run
(EngineCore_DP0 pid=3979203) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core.py”, line 1104, in run_engine_core
(EngineCore_DP0 pid=3979203) raise e
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core.py”, line 1090, in run_engine_core
(EngineCore_DP0 pid=3979203) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore_DP0 pid=3979203) return func(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core.py”, line 834, in init
(EngineCore_DP0 pid=3979203) super().init(
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core.py”, line 120, in init
(EngineCore_DP0 pid=3979203) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore_DP0 pid=3979203) return func(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core.py”, line 252, in _initialize_kv_caches
(EngineCore_DP0 pid=3979203) available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/v1/executor/abstract.py”, line 136, in determine_available_memory
(EngineCore_DP0 pid=3979203) return self.collective_rpc(“determine_available_memory”)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/v1/executor/uniproc_executor.py”, line 76, in collective_rpc
(EngineCore_DP0 pid=3979203) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/v1/serial_utils.py”, line 459, in run_method
(EngineCore_DP0 pid=3979203) return func(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 124, in decorate_context
(EngineCore_DP0 pid=3979203) return func(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/v1/worker/gpu_worker.py”, line 389, in determine_available_memory
(EngineCore_DP0 pid=3979203) self.model_runner.profile_run()
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/v1/worker/gpu_model_runner.py”, line 5275, in profile_run
(EngineCore_DP0 pid=3979203) dummy_encoder_outputs = self.model.embed_multimodal(
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/model_executor/models/qwen3_vl.py”, line 1938, in embed_multimodal
(EngineCore_DP0 pid=3979203) image_embeddings = self._process_image_input(multimodal_input)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/model_executor/models/qwen3_vl.py”, line 1456, in _process_image_input
(EngineCore_DP0 pid=3979203) image_embeds = self.visual(pixel_values, grid_thw=grid_thw)
(EngineCore_DP0 pid=3979203) File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=3979203) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1787, in _call_impl
(EngineCore_DP0 pid=3979203) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/model_executor/models/qwen3_vl.py”, line 533, in forward
(EngineCore_DP0 pid=3979203) hidden_states = self.patch_embed(hidden_states)
(EngineCore_DP0 pid=3979203) File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=3979203) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1787, in _call_impl
(EngineCore_DP0 pid=3979203) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/model_executor/models/qwen3_vl.py”, line 169, in forward
(EngineCore_DP0 pid=3979203) x = self.proj(x).view(L, self.hidden_size)
(EngineCore_DP0 pid=3979203) File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=3979203) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/conda/envs/vllm35/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1787, in _call_impl
(EngineCore_DP0 pid=3979203) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/model_executor/custom_op.py”, line 129, in forward
(EngineCore_DP0 pid=3979203) return self._forward_method(*args, **kwargs)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/model_executor/layers/conv.py”, line 250, in forward_native
(EngineCore_DP0 pid=3979203) return self._forward_mulmat(x)
(EngineCore_DP0 pid=3979203) File “/data/lqy/vllm/qwne3.5/vllm/model_executor/layers/conv.py”, line 226, in _forward_mulmat
(EngineCore_DP0 pid=3979203) x = F.linear(
(EngineCore_DP0 pid=3979203) RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)
[rank0]:[W304 10:17:22.193753464 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Distributed communication package - torch.distributed — PyTorch 2.10 documentation (function operator())
(APIServer pid=3978607) Traceback (most recent call last):
(APIServer pid=3978607) File “/data/conda/envs/vllm35/bin/vllm”, line 10, in
(APIServer pid=3978607) sys.exit(main())
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/entrypoints/cli/main.py”, line 73, in main
(APIServer pid=3978607) args.dispatch_function(args)
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/entrypoints/cli/serve.py”, line 112, in cmd
(APIServer pid=3978607) uvloop.run(run_server(args))
(APIServer pid=3978607) File “/data/conda/envs/vllm35/lib/python3.10/site-packages/uvloop/init.py”, line 69, in run
(APIServer pid=3978607) return loop.run_until_complete(wrapper())
(APIServer pid=3978607) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=3978607) File “/data/conda/envs/vllm35/lib/python3.10/site-packages/uvloop/init.py”, line 48, in wrapper
(APIServer pid=3978607) return await main
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/entrypoints/openai/api_server.py”, line 471, in run_server
(APIServer pid=3978607) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/entrypoints/openai/api_server.py”, line 490, in run_server_worker
(APIServer pid=3978607) async with build_async_engine_client(
(APIServer pid=3978607) File “/data/conda/envs/vllm35/lib/python3.10/contextlib.py”, line 199, in aenter
(APIServer pid=3978607) return await anext(self.gen)
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/entrypoints/openai/api_server.py”, line 96, in build_async_engine_client
(APIServer pid=3978607) async with build_async_engine_client_from_engine_args(
(APIServer pid=3978607) File “/data/conda/envs/vllm35/lib/python3.10/contextlib.py”, line 199, in aenter
(APIServer pid=3978607) return await anext(self.gen)
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/entrypoints/openai/api_server.py”, line 137, in build_async_engine_client_from_engine_args
(APIServer pid=3978607) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/async_llm.py”, line 225, in from_vllm_config
(APIServer pid=3978607) return cls(
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/async_llm.py”, line 154, in init
(APIServer pid=3978607) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=3978607) return func(*args, **kwargs)
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core_client.py”, line 127, in make_async_mp_client
(APIServer pid=3978607) return AsyncMPClient(*client_args)
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=3978607) return func(*args, **kwargs)
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core_client.py”, line 906, in init
(APIServer pid=3978607) super().init(
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/core_client.py”, line 569, in init
(APIServer pid=3978607) with launch_core_engines(
(APIServer pid=3978607) File “/data/conda/envs/vllm35/lib/python3.10/contextlib.py”, line 142, in exit
(APIServer pid=3978607) next(self.gen)
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/utils.py”, line 951, in launch_core_engines
(APIServer pid=3978607) wait_for_engine_startup(
(APIServer pid=3978607) File “/data/lqy/vllm/qwne3.5/vllm/v1/engine/utils.py”, line 1010, in wait_for_engine_startup
(APIServer pid=3978607) raise RuntimeError(
(APIServer pid=3978607) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(vllm35) nvidia@localhost:/data/lqy/vllm/qwne3.5$
想问一下，怎么解决

YanHY · 2026 年3 月 4 日 03:52

主要是版本号不匹配等。
先看一下这个几个输出

python -c "import torch; print(torch.__version__, torch.version.cuda)" 输出
pip show vllm 输出（版本号）。
nvidia-smi 的简要信息（驱动版本、CUDA 版本）。
你安装 vLLM 的方式（pip install vllm 还是从源码 pip install -e .）。

903592775 · 2026 年3 月 4 日 04:01

(vllm35) nvidia@localhost:/data/lqy/vllm$ python -c “import torch; print(torch.version, torch.version.cuda)”
2.10.0+cu130 13.0
(vllm35) nvidia@localhost:/data/lqy/vllm$
(vllm35) nvidia@localhost:/data/lqy/vllm$ pip show vllm
Name: vllm
Version: 0.16.1rc1.dev175+gb8401cde0.d20260303.cu130
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs · GitHub
Author: vLLM Team
Author-email:
License-Expression: Apache-2.0
Location: /data/conda/envs/vllm35/lib/python3.10/site-packages
Editable project location: /data/lqy/vllm/qwne3.5
Requires: aiohttp, anthropic, blake3, cachetools, cbor2, cloudpickle, compressed-tensors, depyf, diskcache, einops, fastapi, filelock, flashinfer-python, gguf, grpcio, grpcio-reflection, ijson, lark, llguidance, lm-format-enforcer, mcp, mistral_common, model-hosting-container-standards, msgspec, ninja, numba, numpy, nvidia-cutlass-dsl, openai, openai-harmony, opencv-python-headless, opentelemetry-api, opentelemetry-exporter-otlp, opentelemetry-sdk, opentelemetry-semantic-conventions-ai, outlines_core, partial-json-parser, pillow, prometheus-fastapi-instrumentator, prometheus_client, protobuf, psutil, py-cpuinfo, pybase64, pydantic, python-json-logger, pyyaml, pyzmq, quack-kernels, ray, regex, requests, sentencepiece, setproctitle, tiktoken, tokenizers, tqdm, transformers, typing_extensions, watchfiles, xgrammar
Required-by:
(vllm35) nvidia@localhost:/data/lqy/vllm$
(vllm35) nvidia@localhost:/data/lqy/vllm$ nvidia-smi
Wed Mar 4 12:00:02 2026
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.00 Driver Version: 580.00 CUDA Version: 13.0 |
±----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA Thor Off | 00000000:01:00.0 Off | N/A |
| N/A N/A N/A N/A / N/A | Not Supported | 0% Default |
| | | Disabled |
±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2569 G /usr/lib/xorg/Xorg 0MiB |
| 0 N/A N/A 2861 G /usr/bin/gnome-shell 0MiB |
±----------------------------------------------------------------------------------------+
(vllm35) nvidia@localhost:/data/lqy/vllm$
源码安装pip install -e .
大佬，想问一下怎么解决呢

YanHY · 2026 年3 月 4 日 06:58

短期尝试只跑 Qwen 纯文本模型（Qwen3.5-9B-Instruct 或其他），先让 vLLM 在这台机器上跑起来。如果纯文本模型能稳定跑、只有 Qwen3.5-VL 崩，就进一步说明是该组合的已知兼容性问题，而不是你的环境全局有问题。

还可以试一下使用 NVIDIA 官方 vLLM 容器（基于 CUDA13 的 25.09/25.11 发行），并在容器内按照论坛/issue 中的示例部署 Qwen3.5 模型。

再参考 github 上的一些问题

https://github.com/vllm-project/vllm/issues/31128

https://github.com/vllm-project/vllm/issues/26791

github.com/vllm-project/vllm

[Bug]: Qwen3.5 CUDA Illegal Memory Access in GDN Kernel

opened 08:48AM - 20 Feb 26 UTC

kimbochen

bug

### Your current environment vLLM Nightly on H200 ``` PyTorch version … : 2.10.0+cu130 CUDA used to build PyTorch : 13.0 CUDA runtime version : 13.0.88 Python version : 3.12.12 vLLM Version : 0.16.0rc2.dev294+g6a7b85d94 (git sha: 6a7b85d94) flashinfer-python : 0.6.3 transformers : 4.57.6 triton : 3.6.0 nvidia-nccl-cu13 : 2.28.9 GPU : 8x NVIDIA H200 (TP=8) OS : Ubuntu 22.04.5 LTS Container : vllm/vllm-openai:qwen3_5-cu130 ``` ### 🐛 Describe the bug Server crashes with CUDA illegal memory access in the GDN (Gated Delta Net) linear attention kernel: ``` (Worker_TP4) ERROR [multiproc_executor.py:863] WorkerProc hit an exception. (Worker_TP4) ERROR [multiproc_executor.py:863] Traceback (most recent call last): File "multiproc_executor.py", line 858, in worker_busy_loop output = func(*args, **kwargs) File "worker_base.py", line 361, in execute_model return self.worker.execute_model(scheduler_output) File "gpu_worker.py", line 696, in execute_model output = self.model_runner.execute_model( File "gpu_model_runner.py", line 3542, in execute_model model_output = self._model_forward( File "gpu_model_runner.py", line 3051, in _model_forward return self.model( File "qwen3_5.py", line 726, in forward hidden_states = self.language_model.model( File "qwen3_next.py", line 1139, in forward File "qwen3_next.py", line 1456, in gdn_attention_core self._forward_core( File "qwen3_next.py", line 784, in _forward_core ) = self.chunk_gated_delta_rule( File "qwen3_next.py", line 178, in forward_cuda return fi_chunk_gated_delta_rule( File "qwen3_next.py", line 138, in fi_chunk_gated_delta_rule fi_beta = beta.to(torch.float32) torch.AcceleratorError: CUDA error: an illegal memory access was encountered Followed by Worker_TP3 also crashing: [rank3]: Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent) what(): CUDA error: an illegal memory access was encountered The EngineCore then dies and all subsequent requests return 500: (EngineCore_DP0) RuntimeError: cancelled (APIServer) vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. ``` Benchmark config: - Input sequence length: 8192 - Output sequence length: 128 - Concurrency: 16 - Request rate: inf Server command: ``` vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \ --tensor-parallel-size 8 \ --language-model-only \ --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \ --reasoning-parser qwen3 ``` ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

903592775 · 2026 年3 月 4 日 07:16

好的，谢谢回答
您提到的github上，有一个作者是我，我之前可以正常在jetson跑Qwen3-VL-8B-Instruct，因为之前环境vllm版本过低了，我又要跑qwne3.5模型，所以就建立了一个环境，按照之前的方法进行按照，报错上面的问题，测试跑Qwen3-VL-8B-Instruct也是不行，现在不知道怎么回事了，现在还是不想再在容器跑。

903592775 · 2026 年3 月 4 日 09:29

我通过其它方式已经解决了

YanHY · 2026 年3 月 4 日 10:15

能分享一下设置的步骤吗？

903592775 · 2026 年3 月 5 日 02:19

换另一种方式解决，前面提的这个问题还没有解决。不过用另一种方式解决输入图片时返回报错500