How to switch the pooling method of pooling models

I want to switch the pooling method to ‘first token hidden states’ or ‘[CLS] hidden states‘. How do I realize it.

You can switch the pooling method in vLLM by setting the pooling_type in the pooler configuration. For “first token hidden states,” use pooling_type="CLS" (which refers to the [CLS] token in models that use it). You can do this by passing the --override-pooler-config '{"pooling_type": "CLS"}' argument when running vLLM, or by setting the pooling_type in your code via the PoolerConfig or PoolingParams object. For models that use the first token as [CLS], this will extract the [CLS] hidden state as the embedding or output. For other models, ensure that the first token is indeed the [CLS] token for your use case. See the pooling models documentation and PoolerConfig API for details.

Example command-line usage:

--override-pooler-config '{"pooling_type": "CLS"}'

Would you like more detail or a code example?

Sources:

If I want to perform a classify task, what structure should my model have?

Your model should be a sequence classification model, typically with a single linear classification head on top of the base model (e.g., *ForSequenceClassification architectures). vLLM can also automatically convert compatible generative models (like *ForCausalLM) into classification models using the as_seq_cls_model adapter, as long as the classification head is a single linear layer named score in the model. See supported models and model conversion details.

Would you like more detail or a code example?

Sources:

How can i change the activate function from softmax to sigmoid in classification task.

Set the activation function to sigmoid by disabling softmax in the pooler config: use --override-pooler-config '{"activation": false}' or set activation=False in PoolerConfig/PoolingParams. This will return logits, which you can then apply sigmoid to manually. See PoolingParams API and relevant issue.

Would you like more detail or a code example?

Sources:

Now I have a bge-m3 model with a linear for classification. Could I use vLLM to deploy the model to do classifycation?

Yes, you can use vLLM to deploy a BGE-M3 model with a linear classification head for classification, provided your model follows the *ForSequenceClassification architecture or can be adapted using as_seq_cls_model. vLLM supports automatic conversion for compatible models and will extract class probabilities from the linear head. See supported models and pooling models documentation.

Would you like more detail or a code example?

Sources:

And I also need to change the pooling method to “CLS” and the activation to “None“. Then please give me a code example.

You can set the pooling method to “CLS” and disable the activation (so you get logits, not probabilities) by overriding the pooler config when initializing your LLM. Here’s a code example:

from vllm import LLM

llm = LLM(
    model="your-bge-m3-model-path-or-hf-name",
    task="classify",
    override_pooler_config={"pooling_type": "CLS", "activation": False}
)

prompts = ["Example input text"]
outputs = llm.classify(prompts)
logits = outputs[0].outputs.data  # Access logits directly
print("Logits:", logits)

This configures the model to use CLS pooling and disables the activation function, returning logits. See pooling models documentation.

Would you like more explanation?

Sources:

I also need the model without normalization. And I want to know what type of model checkpoint I should provide to vllm?

Set normalization off by adding "normalize": False to your pooler config:

llm = LLM(
    model="your-bge-m3-model-path-or-hf-name",
    task="classify",
    override_pooler_config={"pooling_type": "CLS", "activation": False, "normalize": False}
)

For the checkpoint, provide a Hugging Face Transformers-compatible directory with config.json, model weights (e.g., pytorch_model.bin or model.safetensors), and tokenizer files. The model should be saved using save_pretrained and follow the *ForSequenceClassification architecture or be convertible as described in supported models.

Would you like more detail on checkpoint structure or saving?

Sources:

Which parameter determines the categories of a classification model?

The number and names of categories (class labels) in a classification model are determined by the num_labels and id2label (or label2id) fields in the model’s Hugging Face config (config.json). See supported models and HF config usage.

Would you like more detail or a code example?

Sources:

Does vllm support the inference of custom classification models?

Yes, vLLM supports inference for custom classification models as long as the model is compatible with Hugging Face Transformers (i.e., has a proper config.json, weights, and follows the *ForSequenceClassification architecture or can be adapted). You can provide a local directory or Hugging Face model path. See supported models and custom model instructions.

Would you like more detail on how to prepare or load your custom model?

Sources:

Thanks please, I need more detail.

Certainly! To use a custom classification model with vLLM, your model must be saved in the Hugging Face Transformers format. This means your model directory should contain at least:

  • config.json (with architectures set to something like YourModelForSequenceClassification, and fields like num_labels, id2label, and label2id if you want custom class names)
  • Model weights (pytorch_model.bin or model.safetensors)
  • Tokenizer files (e.g., tokenizer.json, vocab.txt, tokenizer_config.json)

How to prepare your model:

  1. If you trained your model using Hugging Face Transformers, use model.save_pretrained("your_model_dir") and tokenizer.save_pretrained("your_model_dir").
  2. Make sure your model class is compatible with AutoModelForSequenceClassification or can be adapted (vLLM can auto-convert some generative models).
  3. If your model is not natively supported, you can still use it by setting trust_remote_code=True and providing the path to your model directory.

How to load and run inference in vLLM:

from vllm import LLM

llm = LLM(
    model="path/to/your_model_dir",
    task="classify",  # or use runner/convert as per latest docs
    trust_remote_code=True  # if using a custom model class
)

prompts = ["Your input text here"]
outputs = llm.classify(prompts)
print(outputs[0].outputs.probs)  # Probabilities for each class

You can also override pooling/activation as needed (see previous answers).

For more, see custom model instructions and classification docs.

Sources: