1、GPTQModel 能量化 GLM-5 FP16 到 INT8 吗?
2、GPTQModel gptqmodel-6.0.3.tar.gz 如何 offload disk,麻烦提供配置示例?
我的理解是:
1、GPTQModel 理论上支持将 GLM-5 FP16 量化为 INT8(4bit/8bit),但官方文档和示例主要以 Llama、Qwen、Gemma 等为例,未明确说明对 GLM-5 的直接支持,建议实际测试兼容性。参考
2、关于 gptqmodel-6.0.3.tar.gz 的 offload disk 配置,官方文档未提供 offload disk 的具体参数或配置示例,仅说明可通过 batch_size 调整显存占用,未见直接 offload 到磁盘的说明。参考
需要更详细的分步说明吗?
Sources:
Hint: Mention @RunLLM in the post for followups.
我参考的示例GPTQModel | vLLM 中文站 一直占用的是CPU内存,如何配置占用算力卡显存?
@RunLLM
1、图片报错的意思是"GPTQModel没有支持GLM-5"吗?
2、GPTQModel量化过程中是使用CPU内存?还是使用GPU显存?如果可以指定的话,分别怎么指定?
我的理解是:
1、如果报错内容为“GPTQModel没有支持GLM-5”或类似信息,说明GPTQModel当前不支持GLM-5模型,需确认GLM-5是否在GPTQModel支持列表内。官方文档未列出GLM-5为明确支持对象。参考
2、GPTQModel量化过程既可用CPU内存,也可用GPU显存。官方示例建议通过batch_size参数调整显存占用,但未提供强制指定CPU或GPU的参数。通常,量化过程默认使用GPU(如有可用),否则使用CPU。参考
需要详细解释如何判断支持情况或资源分配吗?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM 需要详细解释如何资源分配
好的,详细说明如下:
-
GPTQModel 量化时,默认会优先使用可用的GPU显存进行计算。如果没有可用GPU,则会自动回退到CPU内存。你可以通过设置CUDA_VISIBLE_DEVICES环境变量来指定使用哪些GPU,例如export CUDA_VISIBLE_DEVICES=0,1来指定第0和第1号GPU;如果设置为export CUDA_VISIBLE_DEVICES=,则强制只用CPU。参考
-
量化过程中,batch_size参数可以调节每次送入GPU/CPU的数据量,间接影响显存/内存占用。batch_size越大,显存/内存占用越高,但量化速度更快。若显存不足,可适当减小batch_size。GPTQModel未提供直接指定“只用CPU”或“只用GPU”的参数,主要通过CUDA_VISIBLE_DEVICES控制资源分配。参考
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM 这个日志正常吗?没有看到权重文件?是在导出model-*.safetensors文件吗?
日志
Content of saved `config.json`:
{
"architectures": [
"GlmMoeDsaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 0,
"dtype": "bfloat16",
"eos_token_id": [
154820,
154827,
154829
],
"ep_size": 1,
"first_k_dense_replace": 3,
"hidden_act": "silu",
"hidden_size": 6144,
"index_head_dim": 128,
"index_n_heads": 32,
"index_topk": 2048,
"indexer_rope_interleave": true,
"initializer_range": 0.02,
"intermediate_size": 12288,
"kv_lora_rank": 512,
"max_position_embeddings": 202752,
"mlp_layer_types": [
"dense",
"dense",
"dense",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse",
"sparse"
],
"model_type": "glm_moe_dsa",
"moe_intermediate_size": 2048,
"moe_layer_freq": 1,
"n_group": 1,
"n_routed_experts": 256,
"n_shared_experts": 1,
"norm_topk_prob": true,
"num_attention_heads": 64,
"num_experts_per_tok": 8,
"num_hidden_layers": 78,
"num_key_value_heads": 64,
"num_nextn_predict_layers": 1,
"pad_token_id": 154820,
"pretraining_tp": 1,
"q_lora_rank": 2048,
"qk_head_dim": 256,
"qk_nope_head_dim": 192,
"qk_rope_head_dim": 64,
"quantization_config": {
"bits": 8,
"checkpoint_format": "gptq",
"desc_act": false,
"format": "gptq",
"group_size": 128,
"lm_head": false,
"meta": {
"act_group_aware": true,
"auto_forward_data_parallel": false,
"damp_auto_increment": 0.01,
"damp_percent": 0.05,
"fallback": {
"smooth": null,
"strategy": "rtn",
"threshold": "0.5%"
},
"foem": null,
"gc_mode": "on_stage_end",
"gptaq": null,
"hessian": {
"chunk_bytes": null,
"chunk_size": null,
"staging_dtype": "float32"
},
"mock_quantization": false,
"mse": 0.0,
"offload_to_disk": true,
"offload_to_disk_path": "/bmcp_lvm_fs/cusa/nvme5_data/gptqmodel_offload",
"pack_impl": "cpu",
"quantizer": [
"gptqmodel:6.0.3"
],
"static_groups": false,
"true_sequential": true,
"uri": "https://github.com/modelcloud/gptqmodel",
"vram_strategy": "balanced",
"wait_for_submodule_finalizers": true
},
"method": "gptq",
"pack_dtype": "int32",
"quant_method": "gptq",
"sym": true
},
"rms_norm_eps": 1e-05,
"rope_interleave": true,
"rope_parameters": {
"rope_theta": 1000000,
"rope_type": "default"
},
"routed_scaling_factor": 2.5,
"scoring_func": "sigmoid",
"tie_word_embeddings": false,
"topk_group": 1,
"topk_method": "noaux_tc",
"transformers_version": "5.5.4",
"use_cache": true,
"v_head_dim": 256,
"vocab_size": 154880
}
INFO Module: Sync model.embed_tokens <- from turtle (Embedding)
INFO Module: Sync model.norm <- from turtle (GlmMoeDsaRMSNorm)
INFO Module: Sync lm_head <- from turtle (Linear)
INFO Module: Total synced modules: 3; direct tensors materialized: 0
权重文件目录
(base) [root@node27 GLM-5-GPTQ-INT8]# ls -ltr
total 3712
-rw-r–r-- 1 root root 3788208 Apr 24 17:00 quant_log.csv
-rw-r–r-- 1 root root 214 Apr 24 17:00 generation_config.json
-rw-r–r-- 1 root root 3662 Apr 24 17:00 config.json
-rw-r–r-- 1 root root 1051 Apr 24 17:00 quantize_config.json
(base) [root@node27 GLM-5-GPTQ-INT8]#
你的日志显示目前只生成了config.json、quantize_config.json、generation_config.json和quant_log.csv,没有看到model-*.safetensors等权重文件。这种情况通常说明模型权重还没有被导出或保存。
根据你提供的quantization_config,“offload_to_disk”: true 且 “offload_to_disk_path” 已设置,理论上量化权重应该会被offload到 /bmcp_lvm_fs/cusa/nvme5_data/gptqmodel_offload 目录,而不是当前目录。参考
建议你检查 offload_to_disk_path 指定的目录,看看是否有 model-*.safetensors 或类似的权重文件。如果该目录下也没有权重文件,说明量化过程可能未完成或有异常。此日志本身不代表权重已导出。参考
Sources:
Hint: Mention @RunLLM in the post for followups.
