To understand max-num-seqs better!

Does max-num-seqs correspond to user requests or lets say if the inference engine does mixed-prefill-decode which means prefill and decode is done together then the number of tokens that are batched together (i.e specified by max-num-batched-tokens) would correspond to 1 request and multiple such requests is indicated by max-num-seqs

Case 1 - 6 users sends 1 request each and max-num-seqs is 6. Then model can handle all 6 requests because max-num-seqs is 6 . Therefore, concurrent requests is 6

Case 2 - 1 user sends 6 back to back requests with max-num-batched-tokens 50000 and max-model-length = 60000, max-num-seqs is 6.

In case 2 if all the requests are lets assume 60000 tokens each, then because user sends 6 subsequent requests. Therefore 6*60000= 360000 tokens.

Because 50000 tokens can be batched.
Therefore, 360000/50000 = 7.2 Then actual requests the model sees would be more than 7.2 (Why I am saying more than 7.2 is because the calculations just accounts for context length and not decode tokens).

So maybe 2 or more requests maybe waiting since max-num-seqs is 6

max-num-seqs sets the maximum number of sequences (i.e., requests) that can be processed in a single batch, regardless of whether they come from different users or the same user sending multiple requests. Each request (sequence) is counted toward this limit, so if 6 users each send 1 request, or 1 user sends 6 requests, both cases will use up the max-num-seqs limit of 6—meaning up to 6 concurrent requests can be processed at once. Requests beyond this limit are queued and processed in subsequent batches. This is independent of max-num-batched-tokens, which limits the total number of tokens (across all sequences) processed per batch; if the total tokens exceed this, batching is split accordingly, and some requests may wait for the next batch if either limit is reached first. Mixed prefill-decode does not change this accounting: each sequence is still a request, and both limits apply as described.

For your Case 2, if 1 user sends 6 requests of 60,000 tokens each, and max-num-seqs is 6, all 6 requests can be scheduled concurrently, but if max-num-batched-tokens is 50,000, only part of each request can be processed per batch (due to chunked prefill), so the requests will be split and processed over multiple iterations. The number of concurrent requests is still capped by max-num-seqs, not by the token batching. See this explanation, this discussion, and the docs for details.

Would you like a step-by-step breakdown or code example?

Sources:

Hint: Mention @RunLLM in the post for followups.