@RunLLM how to set the priority for batch jobs in vLLM
To set the priority for batch jobs in vLLM, use the priority parameter in your request. For the OpenAI-compatible API, include priority in the extra_body field of your request JSON. Lower values mean higher priority (i.e., processed earlier). You must also start the vLLM server with --scheduling-policy priority to enable priority scheduling. For example:
completion = client.chat.completions.create(
model="your-model",
messages=[{"role": "user", "content": "Prompt"}],
extra_body={"priority": 10}
)
This will assign a priority of 10 to the request. Note: Priority scheduling is only effective if enabled, and a lower priority value means higher scheduling precedence. For more details, see the discussion and usage in PR #5958 and related issue comments.
Would you like more detail or code examples for other interfaces?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM I mean how to set the value? Is there a guide on this?
There is no official guide specifying how to choose or set the exact value for the priority parameter; it is user-defined. The only rule is that lower values are treated as higher priority (processed earlier), and ties are broken by arrival time. You can assign any integer value based on your application’s needs (e.g., 0 for highest, 10 for lower, etc.)—the scale and meaning are up to you. See PR #5958 and API doc.
Would you like more detail or best practices for designing a priority scheme?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM yes, what is the default values, and how to make sure that my batch job does not affect interactive requests
By default, the priority value is 0 if not specified. Lower values mean higher priority (e.g., -1 is higher priority than 0, and 10 is lower priority). To ensure your batch jobs do not affect interactive requests, assign higher (less urgent) priority values to batch jobs (e.g., priority=10 or higher), and use lower (more urgent) values (e.g., priority=0 or priority=-1) for interactive requests. The scheduler will always process lower-value (higher-priority) requests first when using --scheduling-policy priority (docs, PR #5958).
For best practices:
- Define a clear mapping (e.g.,
priority=0for interactive,priority=10for batch,priority=20for background). - Always start vLLM with
--scheduling-policy priority. - Optionally, use a Service Level Agreement (SLA) tiering approach (see SLA-Tiered Scheduling RFC) for more granular control, though this is still under discussion.
Would you like a step-by-step example or more on SLA-tiered scheduling?
Sources:
Hint: Mention @RunLLM in the post for followups.
what is the --scheduling-policy priority? @RunLLM
The --scheduling-policy priority flag tells vLLM to schedule requests based on their assigned priority value, with lower values being processed first. If two requests have the same priority, the one that arrived earlier is processed first. The default policy is fcfs (first-come-first-served), but setting --scheduling-policy priority enables priority-based scheduling (docs).
Would you like more detail on how to use this flag or how it affects scheduling?
Sources:
Hint: Mention @RunLLM in the post for followups.