How to load the model successfully through multi-card in vllm? As a beginner, how can I learn and understand this problem? Is there a recommended study blog?
you can try examples in vllm/examples/offline_inference/basic at main · vllm-project/vllm · GitHub
Thanks, I’ll give it a try
Is this it? from REAME.md? Some of these models are likely to be too large for a single GPU. You can split them across multiple GPUs by setting
–tensor-parallel-size to the number of required GPUs.
(I take “multi-card” to mean “2 or more GPUs” inside one computer)
This is probably something different but just yesterday I found Accelerate
I think Accelerate is for assembling different computers into one big virtual resource for PyTorch to consume. Is that right, just very generally speaking?
Well, that’s pretty much what it means.
For this, I’ve recently been looking at vllm/vllm/ distributed /parallel_state.py to learn about distribution, looking at tensor parallelism and pipeline parallelism, which I think is also closer to distributed
At present, I am looking at how to do pipelining parallel in vllm, pull down the model and put it on different Gpus, and the final question is how to change the segmentation strategy of its pipeline.
Ah, thank you. I guess you are advanced-beginner
I am still only diapers-level-beginner