Hey people, im setting up a vllm server on a machine with 4 GPUs (A6000), and im running into some bizzare results while benchmarking avg request time for each models size.
my nvidia drivers are 535.65, CUDA version from nvcc --version is 11.5
the 30B model is placed on 2 GPUs using tensor-paralel 2
i have a test script that averages the response time for some 20 images for each of the models, and the results are very unintuitive to me. Does anyone know what the heck is going on?
Your results—where the 30B model is faster than the 8B and 4B models—are counterintuitive but can be explained by GPU utilization and parallelism. When you run the 30B model with tensor parallelism across 2 GPUs, you effectively double the compute and memory bandwidth available for that model, allowing it to process requests faster than the smaller models running on a single GPU. The 8B and 4B models, each on a single GPU, may be bottlenecked by compute or memory, especially if the A6000 is not fully saturated or if the batch size is small (as with --max_num_seqs 1). This effect is documented in vLLM discussions, where higher tensor parallelism can yield better throughput and lower latency for large models, while small models may not benefit and can even be slower due to underutilization or lack of parallelism (source, source).
Additionally, vLLM’s scheduling and batching are optimized for high throughput with larger models and higher parallelism, so single-sequence, single-GPU runs may not reflect optimal performance. For small models, running multiple concurrent requests or using data parallelism (multiple replicas) is recommended to better utilize the hardware (source). Would you like a step-by-step breakdown or tuning suggestions for your setup?
yepyep, just figured it out looking at the configs, didnt know that -A3B stands for active 3b + moe. Very impressive that they only need 10% of parameters!!