vLLM output vs Ollama

Hi,
I made a RAG based summarization application that used ollama as the inference engine. But I needed to push this in production, where we thought of using vLLM. But it failed. Whenever I ran the application from terminal (it’s a fastapi app so using uvicorn) with some parms gave me a good enough result. But when this was dockerised the outputs changed drastically and whatever worked started producing a lot verbose and repetition at the end and would not stop till the token number was reached. Note the vLLM container was running separately in both the case and I was inferencing using the ip of the container.

Tried with ollama works correctly. So, can someone tell me how can I replicate the ollama’s consistent, perfect outputs in vLLM. I am not passing any chat templates or anything. Things are as raw as the default docker command from website. I have the temp, top_p, top_k. seed, repetition penalty and frequency penalty being passed.

Can you shove Ollama into your production container and see how it responds via requests to the container IP? Maybe it breaks too in that container on production hardware which is surely different from dev hardware

I believe something is wrong with your special tokens handling. Check your eos token handler.

Yeah I can try that. Docker is what seems to break things mostly. Though I’m using ollama and accessing it from the product env work. I will try it with inside a docker container.

Oooh! EOS token was somewhat my next thought, conceptually, except I did not know the term “EOS token” :rofl:

Good content followed by repetition/mess sounds a lot like unwanted padding. I was not sure if it might be the container or the API client or maybe even IPv6 vs IPv4 issues. Figured easiest thing to try first would be trying Ollama inside container, probably to exclude the container and accessing-the-container as potential sources of the problem, narrowing focus to something with vLLM.

Even easier (well, less work than installing Ollama in container) would be to attach inside container and try vLLM from inside the container using curl. If responses are still good+then+bad, that would prove the problem was something with vLLM/config. And you didn’t have to set up Ollama inside container.

Hey @artist-genai did you solve the problem? I am (new and) curious about problems and their solutions. Would love to hear about solution

Yeah, found one. So, the code that I received used Ollama as they were testing it locally, I without a second thought changed it to use langchain_community.llms.vllm import VLLMOpenAI. This class changes the URL to the api /v1/completions. The model I used was an instruct type so it would answer my question but then continue to generate gibberish. The reason was this api resolution. Understood this by hitting the api with same payload in postman and checking what the stop reason was. To hit chat/completions api, you can either use langchain_opena import ChatOpenAI or rather just use the base openai library. It give you all the control over how you create the llm and the request on it.

Any of the hardware or networking was never an issue, tried ollama in containers, normally too.

1 Like

Aha - thank you! “Use the base openai library” is now on my to-do list.

I’ll hazard a 30,000ft guess that to fix, you must have tweaked the app’s API request somewhat to work with an Instruct model via /v1/completions which must expect some value(s) to be supplied by the API caller while the old API URL defaulted to the correctly-working value(s) unless overridden by the caller?