Build issues when serving gpt-oss-20B on Tesla T4 GPUs with vLLM

Hi,

I’m trying to serve the gpt-oss-20B model using vLLM on 4 × Tesla T4 GPUs. Since this requires clustering, I’m planning to use Ray for orchestration.

However, I ran into a hardware compatibility issue: vLLM requires SM80, but Tesla T4 only supports up to SM75. To work around this, I modified the vLLM source (version 0.10.1.1) as suggested in issue #23203

sed -i 's/return 80/return 75/' vllm/vllm/model_executor/layers/quantization/mxfp4.py && \
sed -i 's/has_device_capability(80)/has_device_capability(75)/' \
    vllm/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py

Then I built a custom Docker image using this Dockerfile:

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
WORKDIR /app
COPY ./vllm/vllm .

ENV VLLM_ATTENTION_BACKEND="XFORMERS"

RUN apt-get update && apt-get install -y \
        python3 \
        python3-pip \
        python3-dev \
        git \
        build-essential \
    && python3 -m pip install --upgrade pip \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

ENV CUDA_HOME=/usr/local/cuda
ENV PATH=$CUDA_HOME/bin:$PATH
ENV LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

RUN pip install --default-timeout=600 --no-cache-dir \
        "setuptools>=77.0.3,<80.0.0" \
        wheel \
        build \
        --prefer-binary "cmake>=3.26.1" \
        --prefer-binary "torch==2.7.1" \
        numpy \
        "setuptools-scm>=8" \
        "packaging>=24.2" \
        ninja \
        regex \
        cachetools \
        "jinja2>=3.1.6" \
        "ray[default]==2.49.0"

RUN MAX_JOBS=12 pip install --no-build-isolation --editable .

Problems I’m facing:

  1. Build time is extremely long and the final image size is very large.

  2. I’m not sure if this approach (manually patching SM80 → SM75) is the correct/recommended way. Has anyone successfully done this before?

  3. At the final stage of the build, I get the following error:

error: could not create 'vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so': No such file or directory

Any guidance on whether this is the right approach and how to resolve the build error would be greatly appreciated :folded_hands:

It seems likely that manually patching vLLM to downgrade SM80 to SM75 for Tesla T4 is not a recommended or robust solution, and may lead to build/runtime issues. The error you encountered (could not create ‘vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so’) suggests a build or path problem, possibly because the flash attention kernels or other CUDA extensions are not being built for SM75, or the build system expects SM80+ features that T4 lacks. Official vLLM builds and flash attention kernels are generally targeted at newer architectures (SM80+), and support for SM75 is not guaranteed in recent versions—see the discussion in issue #803 and installation docs.

My understanding is that the recommended approach for older GPUs like T4 (SM75) is to use an older vLLM version and CUDA toolkit that explicitly supports SM75, or to build from source with the correct CUDA arch flags (e.g., torch_cuda_arch_list=7.5). However, even with this, some features (like flash attention v2/v3) may not be available or performant. You may also want to check that your Dockerfile copies the entire vLLM source tree (not just vllm/vllm), as missing files can cause build errors. For more details, see this guidance on CUDA arch compatibility and the official Docker build instructions.

Would you like a step-by-step breakdown of how to build vLLM for SM75, or more detail on the build error?

Sources:

Hint: Mention @RunLLM in the post for followups.

yes, please. Also, give me your suggested docker file to solve this problem.

@sarvin-m You have to use @RunLLM for reply :slightly_smiling_face:

yes, please. Also, give me your suggested docker file to solve this problem.

@RunLLM

Step-by-step, here’s how you can build vLLM for Tesla T4 (SM75) and avoid the build/runtime issues you encountered:

1. Use an older vLLM version and CUDA 11.8:
Recent vLLM and flash attention kernels are not guaranteed to support SM75. vLLM 0.2.x with CUDA 11.8 is known to work on T4. See issue #803 and installation docs.

2. Set the correct CUDA arch flags:
When building from source, set TORCH_CUDA_ARCH_LIST=7.5 to ensure kernels are compiled for T4.

3. Dockerfile suggestion:
Here’s a Dockerfile that should work for T4, using vLLM 0.2.4 and CUDA 11.8:

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

WORKDIR /app

RUN apt-get update && apt-get install -y \
    python3 python3-pip python3-dev git build-essential \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

ENV CUDA_HOME=/usr/local/cuda
ENV PATH=$CUDA_HOME/bin:$PATH
ENV LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
ENV TORCH_CUDA_ARCH_LIST=7.5

RUN pip3 install --upgrade pip
RUN pip3 install torch==2.1.2+cu118 torchvision==0.16.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
RUN pip3 install vllm==0.2.4

# (Optional) Install Ray if you need clustering
RUN pip3 install "ray[default]==2.9.3"

# Expose port for vLLM server
EXPOSE 8000

CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", "--model", "gpt-oss-20b", "--tensor-parallel-size", "4"]

4. Notes:

  • Do not patch SM80→SM75 in the source; use the correct arch flags and compatible versions instead.
  • If you need a newer vLLM, you must patch and build from source, but many features (like flash attention v2/v3) will not work on T4.
  • Always copy the full vLLM source tree if building from source, not just vllm/vllm.

5. Build and run:

docker build -t vllm-t4 .
docker run --gpus all -p 8000:8000 vllm-t4

This approach avoids the build errors and is the most robust for T4. Let me know if you need more detail or want a Dockerfile for a different vLLM version.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM I decided to serve gpt-oss-20B with FP 16 and this way I don’t need to build vllm source. but I have problem to use suitable backend for it.when I choose xformer, I face bellow error:
(RayWorkerWrapper pid=854, ip=10.233.116.201) ERROR 09-28 22:28:39 [worker_base.py:619] Error executing method ‘load_model’. This might cause deadlock in distributed execution. TypeError: XFormersImpl._init_() got an unexpected keyword argument ‘sinks’.

Please help me to choose the best backend for this scenario.

The error TypeError: XFormersImpl.init() got an unexpected keyword argument ‘sinks’ occurs because the gpt-oss-20B model requires an attention backend that supports the “sinks” argument, which XFormers does not. For gpt-oss-20B, vLLM officially supports only newer GPUs (H100, H200, B200, MI300x, MI325x, MI355x, Radeon AI PRO R9700) with FlashAttention 3 or FlashInfer backends; XFormers and older attention backends are not compatible with the model’s requirements. There is no supported backend for T4 (SM75) or similar GPUs for this model in vLLM, even with FP16, as per the official vLLM GPT-OSS guide and related issues.

If you are not on supported hardware, there is currently no backend that will work for gpt-oss-20B in vLLM. You may need to use a different model or upgrade your hardware.

Would you like more detail on hardware requirements or alternative approaches?

Sources:

Hint: Mention @RunLLM in the post for followups.

Do you mean that it is absolutely impossible to serve the gpt-oss-20B model with vllm on 4 Tesla T4 GPUs?

@RunLLM

Yes, it is currently impossible to serve gpt-oss-20B with vLLM on 4 Tesla T4 GPUs. The model requires attention backends (FlashAttention 3 or FlashInfer) that are only supported on newer GPUs (H100, H200, B200, MI300x, etc.), and T4 (SM75) is not supported for these features, even with FP16. This is confirmed in the official vLLM GPT-OSS guide and multiple GitHub issues.

Would you like more detail or suggestions for alternative models?

Sources:

Hint: Mention @RunLLM in the post for followups.