VLLM and Tensor RT

Hi, I was interested in knowing if vLLM would ever speedup their kernels by like using something similar to tensor rt or some high-end inference library?

Thank you

vLLM does not currently support direct integration with TensorRT-LLM or running TensorRT-compiled *.engine files, as the binary format is specific to TensorRT and not open. However, vLLM achieves high performance through its own optimized CUDA kernels, quantization methods (e.g., GPTQ, AWQ, INT4, INT8, FP8), and features like FlashAttention, speculative decoding, and chunked prefill. These optimizations provide state-of-the-art throughput comparable to or exceeding other inference frameworks in many scenarios, without relying on external high-end inference libraries like TensorRT-LLM. For more, see docs and discussion.

Would you like more detail on vLLM’s kernel optimizations or its roadmap for future performance improvements?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM does it use the standard torch kernels or better kernels that have fused ops and are more optimized

vLLM uses highly optimized custom CUDA kernels, not just standard PyTorch kernels. It includes fused and specialized kernels for key operations such as attention (PagedAttention, FlashAttention), rotary embedding, quantized linear layers (e.g., GPTQ, AWQ, INT8, FP8), and fused layernorms, all designed for maximum throughput and efficiency beyond what standard PyTorch provides. See docs and kernel PRs, rotary embedding kernel.

Would you like more detail on which specific fused kernels are used or how they compare to PyTorch?

Sources:

Hint: Mention @RunLLM in the post for followups.