When processing the image of a white kitten in a field, the output is drastically cut off? How can this be?
/home/bdck/PROJECTS_WSL/VLM/venv_vlm/bin/python /home/bdck/PROJECTS_WSL/VLM/VLM_redoUsingVllm.py
INFO 05-27 15:28:28 [init.py:239] Automatically detected platform cuda.
INFO 05-27 15:28:35 [config.py:600] This model supports multiple tasks: {‘classify’, ‘embed’, ‘score’, ‘reward’, ‘generate’}. Defaulting to ‘generate’.
INFO 05-27 15:28:35 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 05-27 15:28:35 [cuda.py:96] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 05-27 15:28:36 [utils.py:2273] We must use the spawn
multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to ‘spawn’. See Troubleshooting - vLLM for more information. Reason: CUDA is initialized
INFO 05-27 15:28:39 [init.py:239] Automatically detected platform cuda.
INFO 05-27 15:28:40 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model=‘OpenGVLab/InternVL3-2B’, speculative_config=None, tokenizer=‘OpenGVLab/InternVL3-2B’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend=‘xgrammar’, reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=OpenGVLab/InternVL3-2B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={“splitting_ops”:,“compile_sizes”:,“cudagraph_capture_sizes”:,“max_capture_size”:0}
2025-05-27 15:28:40,701 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
WARNING 05-27 15:28:40 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f07415a1990>
INFO 05-27 15:28:41 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
WARNING 05-27 15:28:41 [interface.py:304] Using ‘pin_memory=False’ as WSL is detected. This may slow down the performance.
INFO 05-27 15:28:41 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 05-27 15:28:41 [gpu_model_runner.py:1258] Starting to load model OpenGVLab/InternVL3-2B…
INFO 05-27 15:28:41 [config.py:3334] cudagraph sizes specified by model runner is overridden by config
INFO 05-27 15:28:42 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.
INFO 05-27 15:28:42 [weight_utils.py:265] Using model weights format [‘*.safetensors’]
INFO 05-27 15:28:42 [weight_utils.py:315] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.70it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.70it/s]
INFO 05-27 15:28:43 [loader.py:447] Loading weights took 0.62 seconds
INFO 05-27 15:28:43 [gpu_model_runner.py:1273] Model loading took 3.9110 GiB and 1.253380 seconds
INFO 05-27 15:28:43 [gpu_model_runner.py:1542] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 image items of the maximum feature size.
INFO 05-27 15:28:45 [kv_cache_utils.py:578] GPU KV cache size: 18,528 tokens
INFO 05-27 15:28:45 [kv_cache_utils.py:581] Maximum concurrency for 512 tokens per request: 36.19x
INFO 05-27 15:28:45 [core.py:162] init engine (profile, create kv cache, warmup model) took 2.03 seconds
The content of this image is in English. It shows a picture of a small
[rank0]:[W527 15:28:46.848639031 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Distributed communication package - torch.distributed — PyTorch 2.7 documentation (function operator())