On a prefill test (with max_tokens=1) generating multiple requests of lengths 6000 tokens, while max_num_batched_tokens is 60000, max_num_seqs is 16. The vllm:num_requests_running is always 1, even when vllm:num_requests_waiting gets to 30. It runs on A100.
The log for a request is:
————————————————————————————–
params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=, stop_token_ids=, bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), ………
……. prompt_embeds shape: None, lora_request: None.
—————————————————————————————
This is the server startup log, Any idea ?
—————————————————————————————-
INFO 08-29 09:22:20 [init.py:235] Automatically detected platform cuda.
INFO 08-29 09:22:24 [api_server.py:1755] vLLM API server version 0.10.1.dev1+gbcc0a3cbe
INFO 08-29 09:22:24 [cli_args.py:261] non-default args: {‘model_tag’: ‘mistralai/Mistral-7B-Instruct-v0.2’, ‘host’: ‘0.0.0.0’, ‘model’: ‘mistralai/Mistral-7B-Instruct-v0.2’, ‘trust_remote_code’: True, ‘max_model_len’: 16384, ‘block_size’: 64, ‘enable_prefix_caching’: False, ‘max_num_batched_tokens’: 60000, ‘max_num_seqs’: 16, ‘enable_chunked_prefill’: False}
INFO 08-29 09:22:35 [config.py:1604] Using max model len 16384
INFO 08-29 09:22:35 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=60000.
/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer_group.py:24: FutureWarning: It is strongly recommended to run mistral models with --tokenizer-mode "mistral"
to ensure correct encoding and decoding.
self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)
INFO 08-29 09:22:44 [init.py:235] Automatically detected platform cuda.
INFO 08-29 09:22:48 [core.py:572] Waiting for init message from front-end.
INFO 08-29 09:22:48 [core.py:71] Initializing a V1 LLM engine (v0.10.1.dev1+gbcc0a3cbe) with config: model=‘mistralai/Mistral-7B-Instruct-v0.2’, speculative_config=None, tokenizer=‘mistralai/Mistral-7B-Instruct-v0.2’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=‘’), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=mistralai/Mistral-7B-Instruct-v0.2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={“level”:3,“debug_dump_path”:“”,“cache_dir”:“”,“backend”:“”,“custom_ops”:,“splitting_ops”:[“vllm.unified_attention”,“vllm.unified_attention_with_output”,“vllm.mamba_mixer2”],“use_inductor”:true,“compile_sizes”:,“inductor_compile_config”:{“enable_auto_functionalized_v2”:false},“inductor_passes”:{},“use_cudagraph”:true,“cudagraph_num_of_warmups”:1,“cudagraph_capture_sizes”:[32,24,16,8,4,2,1],“cudagraph_copy_inputs”:false,“full_cuda_graph”:false,“max_capture_size”:32,“local_cache_dir”:null}
INFO 08-29 09:22:50 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 08-29 09:22:50 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 08-29 09:22:50 [gpu_model_runner.py:1843] Starting to load model mistralai/Mistral-7B-Instruct-v0.2…
INFO 08-29 09:22:50 [gpu_model_runner.py:1875] Loading model from scratch…
INFO 08-29 09:22:50 [cuda.py:290] Using Flash Attention backend on V1 engine.
INFO 08-29 09:22:50 [weight_utils.py:296] Using model weights format [‘*.safetensors’]
INFO 08-29 09:23:13 [weight_utils.py:312] Time spent downloading weights for mistralai/Mistral-7B-Instruct-v0.2: 23.020574 seconds
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:01<00:03, 1.89s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:03<00:01, 1.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.87s/it]
INFO 08-29 09:23:19 [default_loader.py:262] Loading weights took 5.67 seconds
INFO 08-29 09:23:20 [gpu_model_runner.py:1892] Model loading took 13.4967 GiB and 29.303647 seconds
INFO 08-29 09:23:27 [backends.py:530] Using cache directory: /tmp/.cache/vllm/torch_compile_cache/043061b246/rank_0_0/backbone for vLLM’s torch.compile
INFO 08-29 09:23:27 [backends.py:541] Dynamo bytecode transform time: 6.85 s
INFO 08-29 09:23:29 [backends.py:194] Cache the graph for dynamic shape for later use
INFO 08-29 09:23:53 [backends.py:215] Compiling a graph for dynamic shape takes 25.93 s
INFO 08-29 09:24:13 [monitor.py:34] torch.compile takes 32.78 s in total
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ[‘TORCH_CUDA_ARCH_LIST’].
warnings.warn(
INFO 08-29 09:24:14 [gpu_worker.py:255] Available KV cache memory: 19.70 GiB
INFO 08-29 09:24:14 [kv_cache_utils.py:833] GPU KV cache size: 161,344 tokens
INFO 08-29 09:24:14 [kv_cache_utils.py:837] Maximum concurrency for 16,384 tokens per request: 9.85xCapturing CUDA graph shapes: 0%| | 0/7 [00:00<?, ?it/s]
Capturing CUDA graph shapes: 43%|████▎ | 3/7 [00:00<00:00, 24.64it/s]
Capturing CUDA graph shapes: 86%|████████▌ | 6/7 [00:00<00:00, 26.17it/s]
Capturing CUDA graph shapes: 100%|██████████| 7/7 [00:00<00:00, 25.50it/s]
INFO 08-29 09:24:15 [gpu_model_runner.py:2485] Graph capturing finished in 1 secs, took 0.11 GiB
INFO 08-29 09:24:15 [core.py:193] init engine (profile, create kv cache, warmup model) took 54.94 seconds
/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/tokenizer_group.py:24: FutureWarning: It is strongly recommended to run mistral models with--tokenizer-mode "mistral"
to ensure correct encoding and decoding.
self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)