Thank you for creating this.

by BingoBird - opened Sep 18, 2025

Sep 18, 2025

Getting 6 tokens/s on ryzen thinkpad t495 with 33 layers offloaded to gpu (really? hmm)
$ llama-server -m /media/sdb1/Models/granite-3.1-3b-a800m-instruct_Q6_K.gguf -t 5 -c 32768 --chat-template granite -ngl 33
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Vega 8 Graphics (RADV RAVEN) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 5346 (7f323a58) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
system info: n_threads = 5, n_threads_batch = 5, total_threads = 8

system_info: n_threads = 5 (n_threads_batch = 5) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 7
main: loading model
srv load_model: loading model '/media/sdb1/Models/granite-3.1-3b-a800m-instruct_Q6_K.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Vega 8 Graphics (RADV RAVEN)) - 5994 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 322 tensors from /media/sdb1/Models/granite-3.1-3b-a800m-instruct_Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = granitemoe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Model Ibm Granite Granite 3.1 3b A800...
llama_model_loader: - kv 3: general.finetune str = instruct-cache
llama_model_loader: - kv 4: general.basename str = model-ibm-granite-granite-3.1
llama_model_loader: - kv 5: general.size_label str = 3B-a800M
llama_model_loader: - kv 6: granitemoe.block_count u32 = 32
llama_model_loader: - kv 7: granitemoe.context_length u32 = 131072
llama_model_loader: - kv 8: granitemoe.embedding_length u32 = 1536
llama_model_loader: - kv 9: granitemoe.feed_forward_length u32 = 512
llama_model_loader: - kv 10: granitemoe.attention.head_count u32 = 24
llama_model_loader: - kv 11: granitemoe.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: granitemoe.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 13: granitemoe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: granitemoe.expert_count u32 = 40
llama_model_loader: - kv 15: granitemoe.expert_used_count u32 = 8
llama_model_loader: - kv 16: general.file_type u32 = 18
llama_model_loader: - kv 17: granitemoe.vocab_size u32 = 49155
llama_model_loader: - kv 18: granitemoe.rope.dimension_count u32 = 64
llama_model_loader: - kv 19: granitemoe.attention.scale f32 = 0.015625
llama_model_loader: - kv 20: granitemoe.embedding_scale f32 = 12.000000
llama_model_loader: - kv 21: granitemoe.residual_scale f32 = 0.220000
llama_model_loader: - kv 22: granitemoe.logit_scale f32 = 6.000000
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = refact
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,49155] = ["<|end_of_text|>", "", "...
llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,49155] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,48891] = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system...
llama_model_loader: - kv 34: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - type f32: 97 tensors
llama_model_loader: - type q6_K: 225 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q6_K
print_info: file size = 2.53 GiB (6.58 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.2826 MB
print_info: arch = granitemoe
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 1536
print_info: n_layer = 32
print_info: n_head = 24
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 3
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 6.0e+00
print_info: f_attn_scale = 1.6e-02
print_info: n_ff = 512
print_info: n_expert = 40
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 3B
print_info: model params = 3.30 B
print_info: general.name = Model Ibm Granite Granite 3.1 3b A800M Instruct Cache
print_info: f_embedding_scale = 12.000000
print_info: f_residual_scale = 0.220000
print_info: f_attention_scale = 0.015625
print_info: vocab type = BPE
print_info: n_vocab = 49155
print_info: n_merges = 48891
print_info: BOS token = 0 '<|end_of_text|>'
print_info: EOS token = 0 '<|end_of_text|>'
print_info: UNK token = 0 '<|end_of_text|>'
print_info: PAD token = 0 '<|end_of_text|>'
print_info: LF token = 203 'Ċ'
print_info: FIM PRE token = 1 ''
print_info: FIM SUF token = 3 ''
print_info: FIM MID token = 2 ''
print_info: FIM PAD token = 4 ''
print_info: FIM REP token = 18 ''
print_info: EOG token = 0 '<|end_of_text|>'
print_info: EOG token = 4 ''
print_info: EOG token = 18 ''
print_info: max token length = 512
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: Vulkan0 model buffer size = 2586.95 MiB
load_tensors: CPU_Mapped model buffer size = 59.07 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.19 MiB
llama_kv_cache_unified: kv_size = 32768, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
llama_kv_cache_unified: Vulkan0 KV buffer size = 2048.00 MiB
llama_kv_cache_unified: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context: Vulkan0 compute buffer size = 1612.00 MiB
llama_context: Vulkan_Host compute buffer size = 67.01 MiB
llama_context: graph nodes = 2024
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 32768
main: model loaded
main: chat template, chat_template: granite, example_format: '<|start_of_role|>system<|end_of_role|>You are a helpful assistant<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>Hi there<|end_of_text|>
<|start_of_role|>user<|end_of_role|>How are you?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle

At the beginning i'm getting 8.5-9 t/s

It is amazing that this thing with 2.7GB knows more than most humans.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment