I’m running vllm serve v0.8.2 with open webui. It runs well but I noticed on the logs the vllm is still generating tokens even though nothing the request is already done.
Here’s my configuration:
- vllm/vllm-openai:v0.8.2 image
- Llama-3.1-8B-Instruct
- NVIDIA RTX 5000 ADA 16GB
- running with the following options
–trust-remote-code
–dtype float16
-q bitsandbytes
–load-format bitsandbytes
–max-model-len 61000
–chat-template /vllm workspace/examples/tool_chat_template_llama3.1_json.jinja
I’m using Package open-webui · GitHub