The service carrying capacity and computing efficiency issues of vllm

machine July 16, 2025, 9:39am 1

Hi, please tell me the maximum QPS that the bge-m3 model can achieve when deployed with the vllm framework and an H20 GPU, with an input token length of 4096?

Topic		Replies	Views
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	155	September 14, 2025
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	7903	January 5, 2026
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	893	October 21, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	2317	March 3, 2026
Speeding up vllm inference for Qwen2.5-VL General	23	8149	June 27, 2025

The service carrying capacity and computing efficiency issues of vllm

Related topics