Batched processing over openai api

I’ve created a chatbot where an HTML/JS client connects to a websocket server and holds conversations with a qwen3 assistant over openai api to ollama. It works well, and as a next step I am experimenting with vllm in order to improve throughput. My current experiments are going well, but I am curious of the best ways to fine tune the engines batched processing. I’m hoping to limit how many concurrent prompts that a single instance of the engine can handle, as well as tune the size of the kv cache based on the expected size of the prompts coming into vllm (around 2k tokens max) as well as the expected size of tokens generated (probably limited to around 200 tokens). What are the best ways to tune the engine on the command line for these sort of parameters? I know I’ll likely find more, any info is welcome on the subject.

To tune vLLM for your use case, set --max-num-seqs to control the maximum number of concurrent prompts (sequences) processed in a batch, and use --max-num-batched-tokens to limit the total number of tokens per batch. For your expected prompt (2k tokens) and output (200 tokens), set --max-model-len to at least 2200, and adjust --gpu-memory-utilization to allocate sufficient memory for the KV cache. If you want to explicitly control KV cache size, you can use --num_gpu_blocks_override, but you’ll need to calculate the required number of blocks based on your model and memory constraints.

For example:

vllm serve ... --max-num-seqs 16 --max-num-batched-tokens 4096 --max-model-len 2200 --gpu-memory-utilization 0.8

These parameters help balance throughput and memory usage. For more details, see Optimization and Tuning, Context length and batch size, and KV cache size discussion.

Would you like more detail on how to calculate the optimal KV cache size or batch parameters?

Sources:

Hint: Mention @RunLLM in the post for followups.

Is there a way to determine the approximate overhead that a model will require outside of the model’s weights and KV cache? How will this overhead grow with each concurrent sequence/prompt that the engine is able to process in parallel?

@RunLLM

Yes, vLLM measures and logs the overhead outside of model weights and KV cache during initialization. This overhead mainly includes peak activation tensors (from profiling runs), NCCL buffers, and other non-torch allocations. The overhead grows with the number of concurrent sequences, as more activation memory is needed during prefill and decode, but the exact growth depends on model architecture and batch size. vLLM profiles this by running a dummy batch with your configured max_model_len and max_num_seqs, then subtracts the measured non-KV memory from the available GPU memory to allocate the KV cache.

You can see a breakdown of these memory categories in the logs and in the memory profiling documentation. The overhead is not a fixed value and is best determined empirically for your configuration.

Would you like a step-by-step example of how to interpret these logs or calculate the overhead?

Sources:

Hint: Mention @RunLLM in the post for followups.