It could be possible that your observed 500 ms network latency is typical for India ↔ US East traffic over the public internet, as transcontinental round-trip times often fall in the 300–500 ms range. Though I’m not entirely certain, switching from HTTP to WebSockets may not meaningfully reduce raw network latency, since the physical distance and routing are the main contributors; WebSockets can help with connection overhead for many small/streaming messages, but not the speed of light constraint itself. According to the vLLM documentation, the HTTP API is standard and there is no explicit mention of WebSocket support or its impact on latency in the provided sources.
It might be that colocating your FastAPI backend and vLLM server in the same region/VPC would significantly reduce the network portion of your total response time, potentially bringing you closer to your <300 ms goal. While vLLM itself does not provide built-in networking optimizations, general networking strategies like persistent HTTP connections, HTTP/2, and Anycast routing could help reduce overhead, but they cannot overcome the fundamental physical latency. There is no explicit mention in the data sources of successful cross-continent LLM inference optimizations or vLLM-specific networking tricks, but colocating services is a widely recommended approach for minimizing latency in distributed systems. Would you like more detail on any specific aspect, such as persistent connections or deployment architectures?