Can i have multiple clients to a single server?

I setup a server with

sudo docker run --rm -it  -p 8000:8000 --cpuset-cpus="0-63" --gpus all   -v "$(pwd):$(pwd)" -w "$(pwd)"   rajesh550/gh200-vllm:0.10.2 bash
export NUM_CPU=64
export OMP_NUM_THREADS=$NUM_CPU
export VLLM_CPU_OMP_THREADS_BIND=0-$(($NUM_CPU-1))
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --api-server-count=$NUM_CPU \
  --port 8000

Now, I’m trying to launch 16 instances of the client with each invocation doing

vllm bench serve \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
  --dataset-name="sharegpt" \
  --dataset-path="ShareGPT_V3_unfiltered_cleaned_split.json" \
  --num-prompts 4096 \
  --base-url "http://192.222.51.7:8000" \
  --request-rate 55 \
  --result-filename "vllm_serve_separate_client.json" 

However, I see that the server processes crash with 500 internal server error. What could be the reason for this? Can I have multiple clients?

Yes, you can have multiple clients sending requests to a vLLM server, but 500 Internal Server Error typically indicates the server is overloaded or has run out of system resources (e.g., file handles, memory, or shared memory). In your setup, launching 16 clients each with 4096 prompts at a high request rate (55) can easily overwhelm the server, especially if the system’s open file limit (ulimit -n) is not increased, or if Docker’s shared memory (–shm-size) is not set high enough. This is a common cause of 500 errors under heavy load, as discussed in this issue.

To resolve this, increase the open file limit (ulimit -n), and set a larger Docker shared memory size using --shm-size (e.g., --shm-size=24g). Also, ensure your server hardware and vLLM configuration (e.g., --api-server-count, OMP_NUM_THREADS) are tuned for high concurrency. For more details, see this discussion and this Docker deployment guide.

Would you like a step-by-step breakdown of how to tune these parameters for your workload?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM , i have the same issue even if I am just launching two client processes concurrently. What does the –api-server-count determine?

The --api-server-count parameter determines how many API server processes vLLM launches to handle incoming HTTP requests in parallel. Increasing this value can improve throughput by parallelizing input processing, but each process consumes CPU and system resources. If set too high relative to available CPU or system limits, it can cause resource exhaustion and errors, even with only a few clients, as noted in the optimization docs and related issues.

Would you like more detail on how to choose an appropriate value for --api-server-count?

Sources:

Hint: Mention @RunLLM in the post for followups.

@rajesh-s can you try sharing the GPU usage logs here?

It seems like a typical OOM error imo.

Yeah!!!, you can have multiple clients hitting a single vllm server, but the 500s scream resource exhaustion more than anything else.
You’re running with --api-server-count=$NUM_CPU, which spins up 64 workers, and then hammering it with 16 clients at 55 rps each… that’s a lot of concurrent requests.
On top of that, the llama 8B model eats a ton of GPU memory, so unless you’ve got memory overcommit handled, you’ll start seeing crashes.
Best bet is either throttle your request-rate per client, reduce the number of parallel clients, or run multiple vllm server instances on different ports and do a round-robin.
Also check logs for OOM or CUDA errors, because those 500s usually hide the real culprit in vllm.