Can 7900 XTX do batching?

I have done some small testing with vLLM and 2 7900 XTX with gemma-3-12b unquantized.
I have tried to set it to “batch” but I am actually not sure it can do that? So when I put 1 requests which takes 26 seconds, it uses both GPUs full power. So no bottlenecks in pcie bandwidth etc. But when I put 2 requests same time, the same which took 26sec it takes almost 1 minute. My understanding is that it should be faster if it does batching, but according the time it takes it seems to do one after another?

this is how I create the docker:
docker pull rocm/vllm:latest

docker run -it
–dns=8.8.8.8
–network=host
–group-add=video
–ipc=host
–cap-add=SYS_PTRACE
–security-opt seccomp=unconfined
–privileged
–device /dev/kfd
–device /dev/dri
-e VLLM_SLEEP_WHEN_IDLE=1
-e ROCM_VISIBLE_DEVICES=0
-e HIP_VISIBLE_DEVICES=0
-e VLLM_USE_TRITON_FLASH_ATTN=0
-e PYTORCH_TUNABLEOP_ENABLED=1
-e HSA_OVERRIDE_GFX_VERSION=11.0.0
-e PYTORCH_ROCM_ARCH=“gfx1100”
-e GPU_MAX_HW_QUEUES=1
-v /home/ubuntu/vllm_models:/workspace/models
rocm/vllm:latest bash

vllm serve /workspace/models/gemma-3-12b-it/
–tensor-parallel-size 2
–trust-remote-code
–kv-cache-dtype auto
–dtype bfloat16
–max-model-len 4096
–max-num-seqs 64
–port 8000
–served-model-name vllm/gemma-3
–host 0.0.0.0

vLLM does support batching, but on AMD GPUs (like 7900 XTX), batching efficiency may be limited by the backend and model implementation. If two concurrent requests take nearly double the time of a single request, it suggests that requests are not being efficiently batched or parallelized, possibly due to limitations in the ROCm backend or the way the model is loaded and executed. Setting --max-num-seqs 64 allows up to 64 concurrent sequences, but actual batching depends on the scheduler and backend support. Also, using VLLM_USE_TRITON_FLASH_ATTN=0 disables Triton FlashAttention, which may further reduce batching performance on RDNA3 cards, as discussed in this issue.

For optimal batching and throughput, ensure that both GPUs are visible to vLLM (set HIP_VISIBLE_DEVICES=0,1), and that --tensor-parallel-size matches the number of GPUs. Also, check that your requests are sent in a way that allows vLLM to batch them (e.g., concurrent requests with similar prompt lengths). If you still see serial processing, it may be a limitation of the current ROCm support or the specific model backend. More details and troubleshooting steps are available in the vLLM ROCm installation guide and this discussion. Would you like a step-by-step checklist to maximize batching on your setup?

Sources:

With PCIe cards like the 7900 XTX you might find you have better performance with pipeline parallelism instead of tensor parallelism because less data needs to be moved between the GPUs. It’s possible that the end to end latency will be higher though, you’ll have to experiment.

Could you share how you’re sending your parallel requests? And what you see in the vLLM logs while they’re processing?

You’re correct: with PCIe-connected GPUs like the 7900 XTX, pipeline parallelism can sometimes outperform tensor parallelism due to reduced inter-GPU communication, especially when PCIe bandwidth is a bottleneck. However, pipeline parallelism may increase end-to-end latency for individual requests, so results can vary and should be benchmarked for your workload, as described in vLLM’s parallelism docs.

To further diagnose batching and parallelism, it’s important to know how you’re sending parallel requests (e.g., are you using async clients, multiple processes, or a load-testing tool?) and to review the vLLM server logs during request processing. The logs should indicate how many requests are running, waiting, and the GPU KV cache usage, which helps determine if requests are being batched or processed sequentially. Would you like guidance on interpreting these logs or on how to structure your parallel requests for optimal batching?

Sources: