Model fails to load on LM Studio
#5
by
dinhngtu - opened
I'm trying out the UD2.0 quants of Gemma 3 12B.
Using gemma-3-12b-it-UD-Q2_K_XL.gguf with LM Studio CUDA runtime v1.27.1 (based on llama.cpp b5132), LM Studio fails to load the model.
Do I need an updated llama.cpp?
[DEBUG] ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes
[DEBUG] CUDA : ARCHS = 500,610,750,800 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 2060) - 5087 MiB free
[DEBUG] llama_model_loader: loaded meta data with 40 key-value pairs and 626 tensors from gemma-3-12b-it-GGUF\gemma-3-12b-it-UD-Q2_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-12B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-12B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 12B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 10: gemma3.block_count u32 = 48
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256
[DEBUG] llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
[DEBUG] llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
[DEBUG] llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
[DEBUG] llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: general.file_type u32 = 10
llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-12b-it-GGUF/imatrix_unsloth.dat
[DEBUG] llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-12b-it.txt
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 336
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 43
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q2_K: 152 tensors
llama_model_loader: - type q3_K: 129 tensors
llama_model_loader: - type q4_K: 5 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq2_xs: 10 tensors
llama_model_loader: - type iq3_xxs: 10 tensors
llama_model_loader: - type iq3_s: 15 tensors
llama_model_loader: - type iq2_s: 10 tensors
llama_model_loader: - type iq4_xs: 5 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q2_K - Medium
print_info: file size = 4.52 GiB (3.30 BPW)
[DEBUG] load: special tokens cache size = 6415
[DEBUG] load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3840
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 8
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: n_swa_pattern = 6
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
[DEBUG] print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 15360
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 12B
print_info: model params = 11.77 B
print_info: general.name = Gemma-3-12B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
[DEBUG] load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloaded 28/49 layers to GPU
load_tensors: CUDA0 model buffer size = 2200.06 MiB
load_tensors: CPU_Mapped model buffer size = 2429.07 MiB
[DEBUG] llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[DEBUG] llama_context: CPU output buffer size = 1.00 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
[DEBUG] init: CUDA0 KV buffer size = 896.00 MiB
[DEBUG] init: CPU KV buffer size = 640.00 MiB
llama_context: KV self size = 1536.00 MiB, K (f16): 768.00 MiB, V (f16): 768.00 MiB
[DEBUG] llama_context: CUDA0 compute buffer size = 1307.32 MiB
llama_context: CUDA_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 2023
llama_context: graph splits = 304 (with bs=512), 3 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[DEBUG] GgmlThreadpools: llama threadpool init = n_threads = 6
Image cache size:
[DEBUG] 10
clip_ctx: CLIP using CUDA0 backend
[DEBUG] clip_model_loader: model name: Gemma-3-12B-It
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 439
clip_model_loader: n_kv: 23
load_hparams: text_encoder: 0
load_hparams: vision_encoder: 1
load_hparams: llava_projector: 0
load_hparams: minicpmv_projector: 0
load_hparams: minicpmv_version: 2
load_hparams: glm_projector: 0
load_hparams: model size: 134186558.66 MiB
load_hparams: metadata size: 0.15 MiB
[DEBUG] llama.cpp abort:2743: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
You need to wait for the LM Studio to be updated.
dinhngtu changed discussion status to
closed
Do you guys know if it works in llama.cpp fine?
It works with upstream b5193. (edit: still not working with LM Studio b5173)