GPTQModel 能量化 GLM-5 FP16 到 INT8 吗

1、GPTQModel 能量化 GLM-5 FP16 到 INT8 吗?
2、GPTQModel gptqmodel-6.0.3.tar.gz 如何 offload disk,麻烦提供配置示例?

我的理解是:
1、GPTQModel 理论上支持将 GLM-5 FP16 量化为 INT8(4bit/8bit),但官方文档和示例主要以 Llama、Qwen、Gemma 等为例,未明确说明对 GLM-5 的直接支持,建议实际测试兼容性。参考
2、关于 gptqmodel-6.0.3.tar.gz 的 offload disk 配置,官方文档未提供 offload disk 的具体参数或配置示例,仅说明可通过 batch_size 调整显存占用,未见直接 offload 到磁盘的说明。参考

需要更详细的分步说明吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

我参考的示例GPTQModel | vLLM 中文站 一直占用的是CPU内存,如何配置占用算力卡显存?


报错的意思是"GPTQModel没有支持GLM-5"吗?

@RunLLM
1、图片报错的意思是"GPTQModel没有支持GLM-5"吗?
2、GPTQModel量化过程中是使用CPU内存?还是使用GPU显存?如果可以指定的话,分别怎么指定?

我的理解是:
1、如果报错内容为“GPTQModel没有支持GLM-5”或类似信息,说明GPTQModel当前不支持GLM-5模型,需确认GLM-5是否在GPTQModel支持列表内。官方文档未列出GLM-5为明确支持对象。参考
2、GPTQModel量化过程既可用CPU内存,也可用GPU显存。官方示例建议通过batch_size参数调整显存占用,但未提供强制指定CPU或GPU的参数。通常,量化过程默认使用GPU(如有可用),否则使用CPU。参考

需要详细解释如何判断支持情况或资源分配吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM 需要详细解释如何资源分配

好的,详细说明如下:

  1. GPTQModel 量化时,默认会优先使用可用的GPU显存进行计算。如果没有可用GPU,则会自动回退到CPU内存。你可以通过设置CUDA_VISIBLE_DEVICES环境变量来指定使用哪些GPU,例如export CUDA_VISIBLE_DEVICES=0,1来指定第0和第1号GPU;如果设置为export CUDA_VISIBLE_DEVICES=,则强制只用CPU。参考

  2. 量化过程中,batch_size参数可以调节每次送入GPU/CPU的数据量,间接影响显存/内存占用。batch_size越大,显存/内存占用越高,但量化速度更快。若显存不足,可适当减小batch_size。GPTQModel未提供直接指定“只用CPU”或“只用GPU”的参数,主要通过CUDA_VISIBLE_DEVICES控制资源分配。参考

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM 这个日志正常吗?没有看到权重文件?是在导出model-*.safetensors文件吗?

日志

Content of saved `config.json`:
{
    "architectures": [
        "GlmMoeDsaForCausalLM"
    ],
    "attention_bias": false,
    "attention_dropout": 0.0,
    "bos_token_id": 0,
    "dtype": "bfloat16",
    "eos_token_id": [
        154820,
        154827,
        154829
    ],
    "ep_size": 1,
    "first_k_dense_replace": 3,
    "hidden_act": "silu",
    "hidden_size": 6144,
    "index_head_dim": 128,
    "index_n_heads": 32,
    "index_topk": 2048,
    "indexer_rope_interleave": true,
    "initializer_range": 0.02,
    "intermediate_size": 12288,
    "kv_lora_rank": 512,
    "max_position_embeddings": 202752,
    "mlp_layer_types": [
        "dense",
        "dense",
        "dense",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse",
        "sparse"
    ],
    "model_type": "glm_moe_dsa",
    "moe_intermediate_size": 2048,
    "moe_layer_freq": 1,
    "n_group": 1,
    "n_routed_experts": 256,
    "n_shared_experts": 1,
    "norm_topk_prob": true,
    "num_attention_heads": 64,
    "num_experts_per_tok": 8,
    "num_hidden_layers": 78,
    "num_key_value_heads": 64,
    "num_nextn_predict_layers": 1,
    "pad_token_id": 154820,
    "pretraining_tp": 1,
    "q_lora_rank": 2048,
    "qk_head_dim": 256,
    "qk_nope_head_dim": 192,
    "qk_rope_head_dim": 64,
    "quantization_config": {
        "bits": 8,
        "checkpoint_format": "gptq",
        "desc_act": false,
        "format": "gptq",
        "group_size": 128,
        "lm_head": false,
        "meta": {
            "act_group_aware": true,
            "auto_forward_data_parallel": false,
            "damp_auto_increment": 0.01,
            "damp_percent": 0.05,
            "fallback": {
                "smooth": null,
                "strategy": "rtn",
                "threshold": "0.5%"
            },
            "foem": null,
            "gc_mode": "on_stage_end",
            "gptaq": null,
            "hessian": {
                "chunk_bytes": null,
                "chunk_size": null,
                "staging_dtype": "float32"
            },
            "mock_quantization": false,
            "mse": 0.0,
            "offload_to_disk": true,
            "offload_to_disk_path": "/bmcp_lvm_fs/cusa/nvme5_data/gptqmodel_offload",
            "pack_impl": "cpu",
            "quantizer": [
                "gptqmodel:6.0.3"
            ],
            "static_groups": false,
            "true_sequential": true,
            "uri": "https://github.com/modelcloud/gptqmodel",
            "vram_strategy": "balanced",
            "wait_for_submodule_finalizers": true
        },
        "method": "gptq",
        "pack_dtype": "int32",
        "quant_method": "gptq",
        "sym": true
    },
    "rms_norm_eps": 1e-05,
    "rope_interleave": true,
    "rope_parameters": {
        "rope_theta": 1000000,
        "rope_type": "default"
    },
    "routed_scaling_factor": 2.5,
    "scoring_func": "sigmoid",
    "tie_word_embeddings": false,
    "topk_group": 1,
    "topk_method": "noaux_tc",
    "transformers_version": "5.5.4",
    "use_cache": true,
    "v_head_dim": 256,
    "vocab_size": 154880
}
INFO  Module: Sync model.embed_tokens <- from turtle (Embedding)               
INFO  Module: Sync model.norm <- from turtle (GlmMoeDsaRMSNorm)                
INFO  Module: Sync lm_head <- from turtle (Linear)                             
INFO  Module: Total synced modules: 3; direct tensors materialized: 0   

权重文件目录
(base) [root@node27 GLM-5-GPTQ-INT8]# ls -ltr
total 3712
-rw-r–r-- 1 root root 3788208 Apr 24 17:00 quant_log.csv
-rw-r–r-- 1 root root 214 Apr 24 17:00 generation_config.json
-rw-r–r-- 1 root root 3662 Apr 24 17:00 config.json
-rw-r–r-- 1 root root 1051 Apr 24 17:00 quantize_config.json
(base) [root@node27 GLM-5-GPTQ-INT8]#

你的日志显示目前只生成了config.json、quantize_config.json、generation_config.json和quant_log.csv,没有看到model-*.safetensors等权重文件。这种情况通常说明模型权重还没有被导出或保存。

根据你提供的quantization_config,“offload_to_disk”: true 且 “offload_to_disk_path” 已设置,理论上量化权重应该会被offload到 /bmcp_lvm_fs/cusa/nvme5_data/gptqmodel_offload 目录,而不是当前目录。参考

建议你检查 offload_to_disk_path 指定的目录,看看是否有 model-*.safetensors 或类似的权重文件。如果该目录下也没有权重文件,说明量化过程可能未完成或有异常。此日志本身不代表权重已导出。参考

Sources:

Hint: Mention @RunLLM in the post for followups.