Yes, you can have multiple clients sending requests to a vLLM server, but 500 Internal Server Error typically indicates the server is overloaded or has run out of system resources (e.g., file handles, memory, or shared memory). In your setup, launching 16 clients each with 4096 prompts at a high request rate (55) can easily overwhelm the server, especially if the system’s open file limit (ulimit -n) is not increased, or if Docker’s shared memory (–shm-size) is not set high enough. This is a common cause of 500 errors under heavy load, as discussed in this issue.
To resolve this, increase the open file limit (ulimit -n), and set a larger Docker shared memory size using --shm-size (e.g., --shm-size=24g). Also, ensure your server hardware and vLLM configuration (e.g., --api-server-count, OMP_NUM_THREADS) are tuned for high concurrency. For more details, see this discussion and this Docker deployment guide.
Would you like a step-by-step breakdown of how to tune these parameters for your workload?
The --api-server-count parameter determines how many API server processes vLLM launches to handle incoming HTTP requests in parallel. Increasing this value can improve throughput by parallelizing input processing, but each process consumes CPU and system resources. If set too high relative to available CPU or system limits, it can cause resource exhaustion and errors, even with only a few clients, as noted in the optimization docs and related issues.
Would you like more detail on how to choose an appropriate value for --api-server-count?
Yeah!!!, you can have multiple clients hitting a single vllm server, but the 500s scream resource exhaustion more than anything else.
You’re running with --api-server-count=$NUM_CPU, which spins up 64 workers, and then hammering it with 16 clients at 55 rps each… that’s a lot of concurrent requests.
On top of that, the llama 8B model eats a ton of GPU memory, so unless you’ve got memory overcommit handled, you’ll start seeing crashes.
Best bet is either throttle your request-rate per client, reduce the number of parallel clients, or run multiple vllm server instances on different ports and do a round-robin.
Also check logs for OOM or CUDA errors, because those 500s usually hide the real culprit in vllm.