With PCIe cards like the 7900 XTX you might find you have better performance with pipeline parallelism instead of tensor parallelism because less data needs to be moved between the GPUs. It’s possible that the end to end latency will be higher though, you’ll have to experiment.
Could you share how you’re sending your parallel requests? And what you see in the vLLM logs while they’re processing?