vLLM output vs Ollama

artist-genai · April 1, 2025, 9:20am

Hi,
I made a RAG based summarization application that used ollama as the inference engine. But I needed to push this in production, where we thought of using vLLM. But it failed. Whenever I ran the application from terminal (it’s a fastapi app so using uvicorn) with some parms gave me a good enough result. But when this was dockerised the outputs changed drastically and whatever worked started producing a lot verbose and repetition at the end and would not stop till the token number was reached. Note the vLLM container was running separately in both the case and I was inferencing using the ip of the container.

Tried with ollama works correctly. So, can someone tell me how can I replicate the ollama’s consistent, perfect outputs in vLLM. I am not passing any chat templates or anything. Things are as raw as the default docker command from website. I have the temp, top_p, top_k. seed, repetition penalty and frequency penalty being passed.

chris · April 1, 2025, 12:32pm

Can you shove Ollama into your production container and see how it responds via requests to the container IP? Maybe it breaks too in that container on production hardware which is surely different from dev hardware

DystopianJunkyardKid · April 1, 2025, 1:56pm

I believe something is wrong with your special tokens handling. Check your eos token handler.

artist-genai · April 2, 2025, 9:31am

Yeah I can try that. Docker is what seems to break things mostly. Though I’m using ollama and accessing it from the product env work. I will try it with inside a docker container.

chris · April 2, 2025, 11:59am

Oooh! EOS token was somewhat my next thought, conceptually, except I did not know the term “EOS token”

Good content followed by repetition/mess sounds a lot like unwanted padding. I was not sure if it might be the container or the API client or maybe even IPv6 vs IPv4 issues. Figured easiest thing to try first would be trying Ollama inside container, probably to exclude the container and accessing-the-container as potential sources of the problem, narrowing focus to something with vLLM.

Even easier (well, less work than installing Ollama in container) would be to attach inside container and try vLLM from inside the container using curl. If responses are still good+then+bad, that would prove the problem was something with vLLM/config. And you didn’t have to set up Ollama inside container.

chris · April 7, 2025, 12:09pm

Hey @artist-genai did you solve the problem? I am (new and) curious about problems and their solutions. Would love to hear about solution

artist-genai · April 10, 2025, 10:13am

Yeah, found one. So, the code that I received used Ollama as they were testing it locally, I without a second thought changed it to use langchain_community.llms.vllm import VLLMOpenAI. This class changes the URL to the api /v1/completions. The model I used was an instruct type so it would answer my question but then continue to generate gibberish. The reason was this api resolution. Understood this by hitting the api with same payload in postman and checking what the stop reason was. To hit chat/completions api, you can either use langchain_opena import ChatOpenAI or rather just use the base openai library. It give you all the control over how you create the llm and the request on it.

artist-genai · April 10, 2025, 10:16am

Any of the hardware or networking was never an issue, tried ollama in containers, normally too.

chris · April 10, 2025, 12:06pm

Aha - thank you! “Use the base openai library” is now on my to-do list.

I’ll hazard a 30,000ft guess that to fix, you must have tweaked the app’s API request somewhat to work with an Instruct model via /v1/completions which must expect some value(s) to be supplied by the API caller while the old API URL defaulted to the correctly-working value(s) unless overridden by the caller?

Topic		Replies	Views
Welcome to vLLM Forums! :wave: General	1	366	March 24, 2025
Text generation doesn't stop General	1	42	April 8, 2025
Irrelevant Responses with Unsloth Fine-tuned Llama 3.1 8B using vLLM General	3	29	April 10, 2025
Numerical Difference between vLLM logprobs and huggingface logprobs RL Integration	7	187	April 4, 2025
How to use vllm server in intranet General	5	56	April 2, 2025

vLLM output vs Ollama

Related topics