Batched processing over openai api

json · November 30, 2025, 4:53am

I’ve created a chatbot where an HTML/JS client connects to a websocket server and holds conversations with a qwen3 assistant over openai api to ollama. It works well, and as a next step I am experimenting with vllm in order to improve throughput. My current experiments are going well, but I am curious of the best ways to fine tune the engines batched processing. I’m hoping to limit how many concurrent prompts that a single instance of the engine can handle, as well as tune the size of the kv cache based on the expected size of the prompts coming into vllm (around 2k tokens max) as well as the expected size of tokens generated (probably limited to around 200 tokens). What are the best ways to tune the engine on the command line for these sort of parameters? I know I’ll likely find more, any info is welcome on the subject.

RunLLM · November 30, 2025, 4:53am

To tune vLLM for your use case, set --max-num-seqs to control the maximum number of concurrent prompts (sequences) processed in a batch, and use --max-num-batched-tokens to limit the total number of tokens per batch. For your expected prompt (2k tokens) and output (200 tokens), set --max-model-len to at least 2200, and adjust --gpu-memory-utilization to allocate sufficient memory for the KV cache. If you want to explicitly control KV cache size, you can use --num_gpu_blocks_override, but you’ll need to calculate the required number of blocks based on your model and memory constraints.

For example:

vllm serve ... --max-num-seqs 16 --max-num-batched-tokens 4096 --max-model-len 2200 --gpu-memory-utilization 0.8

These parameters help balance throughput and memory usage. For more details, see Optimization and Tuning, Context length and batch size, and KV cache size discussion.

Would you like more detail on how to calculate the optimal KV cache size or batch parameters?

Sources:

Hint: Mention @RunLLM in the post for followups.

json · November 30, 2025, 5:02am

Is there a way to determine the approximate overhead that a model will require outside of the model’s weights and KV cache? How will this overhead grow with each concurrent sequence/prompt that the engine is able to process in parallel?

@RunLLM

RunLLM · November 30, 2025, 5:02am

Yes, vLLM measures and logs the overhead outside of model weights and KV cache during initialization. This overhead mainly includes peak activation tensors (from profiling runs), NCCL buffers, and other non-torch allocations. The overhead grows with the number of concurrent sequences, as more activation memory is needed during prefill and decode, but the exact growth depends on model architecture and batch size. vLLM profiles this by running a dummy batch with your configured max_model_len and max_num_seqs, then subtracts the measured non-KV memory from the available GPU memory to allocate the KV cache.

You can see a breakdown of these memory categories in the logs and in the memory profiling documentation. The overhead is not a fixed value and is best determined empirically for your configuration.

Would you like a step-by-step example of how to interpret these logs or calculate the overhead?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Whats happens internally when I do llm.generate(..) General	3	140	September 3, 2025
The vLLM server fails on large max-num-batched-tokens General	1	310	September 13, 2025
Why vllm cannot fully use GPU in batch processing General	12	968	March 29, 2025
A question about request handling General	5	472	August 21, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	122	November 10, 2025

Batched processing over openai api

Related topics