VLLM and Tensor RT

HoneyBerries · November 4, 2025, 6:21am

Hi, I was interested in knowing if vLLM would ever speedup their kernels by like using something similar to tensor rt or some high-end inference library?

Thank you

RunLLM · November 4, 2025, 6:22am

vLLM does not currently support direct integration with TensorRT-LLM or running TensorRT-compiled *.engine files, as the binary format is specific to TensorRT and not open. However, vLLM achieves high performance through its own optimized CUDA kernels, quantization methods (e.g., GPTQ, AWQ, INT4, INT8, FP8), and features like FlashAttention, speculative decoding, and chunked prefill. These optimizations provide state-of-the-art throughput comparable to or exceeding other inference frameworks in many scenarios, without relying on external high-end inference libraries like TensorRT-LLM. For more, see docs and discussion.

Would you like more detail on vLLM’s kernel optimizations or its roadmap for future performance improvements?

Sources:

Hint: Mention @RunLLM in the post for followups.

HoneyBerries · November 4, 2025, 6:23am

@RunLLM does it use the standard torch kernels or better kernels that have fused ops and are more optimized

RunLLM · November 4, 2025, 6:24am

vLLM uses highly optimized custom CUDA kernels, not just standard PyTorch kernels. It includes fused and specialized kernels for key operations such as attention (PagedAttention, FlashAttention), rotary embedding, quantized linear layers (e.g., GPTQ, AWQ, INT8, FP8), and fused layernorms, all designed for maximum throughput and efficiency beyond what standard PyTorch provides. See docs and kernel PRs, rotary embedding kernel.

Would you like more detail on which specific fused kernels are used or how they compare to PyTorch?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Comparsion with omniserve(Lserve, Qserve) General	0	76	April 14, 2025
目前vllm支持哪些推测解码方案 General	3	132	December 9, 2025
vLLM running on NVIDIA NIM vs Native VLLM tunning options General	1	141	January 10, 2026
Support for (sparse) key value caching KV-Cache	16	465	May 3, 2025
vLLM on RTX5090: Working GPU setup with torch 2.9.0 cu128 NVIDIA GPU Support	18	4798	January 13, 2026

VLLM and Tensor RT

Related topics