Upload FP8Qwen3ForCausalLM

Browse files

Files changed (9) hide show

README.md +199 -0
config.json +86 -0
configuration_fp8_qwen3.py +154 -0
generation_config.json +6 -0
model-00001-of-00003.safetensors +3 -0
model-00002-of-00003.safetensors +3 -0
model-00003-of-00003.safetensors +3 -0
model.safetensors.index.json +659 -0
modeling_fp8_qwen3.py +518 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,86 @@

+{
+  "architectures": [
+    "FP8Qwen3ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_fp8_qwen3.FP8Qwen3Config",
+    "AutoModel": "modeling_fp8_qwen3.FP8Qwen3Model",
+    "AutoModelForCausalLM": "modeling_fp8_qwen3.FP8Qwen3ForCausalLM",
+    "AutoModelForQuestionAnswering": "modeling_fp8_qwen3.FP8Qwen3ForQuestionAnswering",
+    "AutoModelForSequenceClassification": "modeling_fp8_qwen3.FP8Qwen3ForSequenceClassification",
+    "AutoModelForTokenClassification": "modeling_fp8_qwen3.FP8Qwen3ForTokenClassification"
+  },
+  "bos_token_id": 151643,
+  "dtype": "float8_e4m3fn",
+  "eos_token_id": 151645,
+  "fp8_config": {
+    "act_block_size": 16,
+    "float8_dtype": "torch.float8_e4m3fn",
+    "layer_name": "",
+    "mm_block_size": 128,
+    "quant_type": "DIV",
+    "training_mode": false
+  },
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 12288,
+  "layer_types": [
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 40960,
+  "max_window_layers": 36,
+  "model_name_orig": "Qwen/Qwen3-8B",
+  "model_name_quant": null,
+  "model_type": "fp8_qwen3",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 36,
+  "num_key_value_heads": 8,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": null,
+  "rope_theta": 1000000,
+  "sliding_window": null,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.57.0",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vocab_size": 151936
+}

configuration_fp8_qwen3.py ADDED Viewed

	@@ -0,0 +1,154 @@

+# coding=utf-8
+# Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Qwen3 model configuration"""
+import torch
+from typing import Optional
+from dataclasses import dataclass, asdict
+from enum import Enum
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+from transformers.models.qwen3.configuration_qwen3 import Qwen3Config
+from quasar.kernel.configs import QuantType
+logger = logging.get_logger(__name__)
+@dataclass
+class FP8Config:
+    """
+    Configuration for FP8 quantization.
+    """
+    float8_dtype: torch.dtype = torch.float8_e4m3fn
+    quant_type: QuantType = QuantType.DIV
+    layer_name: str = ""
+    act_block_size: int = 16
+    mm_block_size: int = 128
+    training_mode: bool = True
+    """
+    If True, the linear layer will use high-precision weight.
+    If False, the linear layer will use per-block quantized weight.
+    """
+class FP8Qwen3Config(Qwen3Config):
+    model_type = "fp8_qwen3"
+    fp8_config: FP8Config = FP8Config()
+    model_name_orig: str = ""
+    model_name_quant: str = ""
+    """Pass the name of the BF16 model"""
+    def __init__(
+        self,
+        vocab_size=151936,
+        hidden_size=4096,
+        intermediate_size=22016,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=32,
+        head_dim=128,
+        hidden_act="silu",
+        max_position_embeddings=32768,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        attention_bias=False,
+        use_sliding_window=False,
+        sliding_window=4096,
+        max_window_layers=28,
+        layer_types=None,
+        attention_dropout=0.0,
+        # Customized configs begins here
+        fp8_config=None,
+        model_name_orig="",
+        model_name_quant="",
+        **kwargs,
+    ):
+        super().__init__(
+            vocab_size=vocab_size,
+            hidden_size=hidden_size,
+            intermediate_size=intermediate_size,
+            num_hidden_layers=num_hidden_layers,
+            num_attention_heads=num_attention_heads,
+            num_key_value_heads=num_key_value_heads,
+            head_dim=head_dim,
+            hidden_act=hidden_act,
+            max_position_embeddings=max_position_embeddings,
+            initializer_range=initializer_range,
+            rms_norm_eps=rms_norm_eps,
+            use_cache=use_cache,
+            tie_word_embeddings=tie_word_embeddings,
+            rope_theta=rope_theta,
+            rope_scaling=rope_scaling,
+            attention_bias=attention_bias,
+            use_sliding_window=use_sliding_window,
+            sliding_window=sliding_window,
+            max_window_layers=max_window_layers,
+            layer_types=layer_types,
+            attention_dropout=attention_dropout,
+            **kwargs,
+        )
+        # Convert it from dict to FP8Config (dataclass)
+        if fp8_config is not None:
+            self.fp8_config = fp8_config if isinstance(fp8_config, FP8Config) else FP8Config(**fp8_config)
+        else:
+            self.fp8_config = FP8Config()
+        self.model_name_orig = model_name_orig
+        self.model_name_quant = model_name_quant
+    def to_dict(self):
+        output = super().to_dict()
+        if hasattr(self.fp8_config, "__dataclass_fields__"):
+            cfg_dict = asdict(self.fp8_config)
+            for k, v in cfg_dict.items():
+                if isinstance(v, torch.dtype): # float8_dtype
+                    cfg_dict[k] = str(v)  # save as 'torch.float8_e4m3fn'
+                elif isinstance(v, Enum): # quant_type
+                    cfg_dict[k] = v.name   # save as 'DIV'
+            output["fp8_config"] = cfg_dict
+        else:
+            output["fp8_config"] = self.fp8_config
+        return output
+    @classmethod
+    def from_dict(cls, config_dict, **kwargs):
+        config = super().from_dict(config_dict, **kwargs)
+        fp8_config = config_dict.get("fp8_config", {})
+        for k, v in fp8_config.items():
+            if k == "float8_dtype":
+                assert v.startswith("torch."), f"Invalid float8_dtype: {v}"
+                fp8_config[k] = getattr(torch, v[len("torch."):]) #
+            elif k == "quant_type":
+                fp8_config[k] = getattr(QuantType, v)
+        config.fp8_config = FP8Config(**fp8_config)
+        return config
+__all__ = ["FP8Qwen3Config"]
+FP8Qwen3Config.register_for_auto_class()

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 151643,
+  "eos_token_id": 151645,
+  "transformers_version": "4.57.0"
+}

model-00001-of-00003.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a1108f22b4218aa531975c5aaaaaf66fcd34121cc7633ee9ed8fb5e68426fdb4
+size 4968167136

model-00002-of-00003.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ec64b885c65202b773fdd91b533270e52d02c1494d5bb3b1f3aca861665df198
+size 4469923408

model-00003-of-00003.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9108ac78362e4b2f53ec98ed190a2f03d618e11ec60257528131c0a5df5be465
+size 2489319552

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,659 @@

+{
+  "metadata": {
+    "total_parameters": 8191159296,
+    "total_size": 11927334912
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00003-of-00003.safetensors",
+    "model.embed_tokens.weight": "model-00002-of-00003.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.0.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.1.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.10.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.11.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.12.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.13.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.14.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.15.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.15.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.16.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.16.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.17.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.17.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.18.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.18.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.19.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.19.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.2.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.20.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.20.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.21.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.21.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.22.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.22.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.23.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.23.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.24.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.24.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.25.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.25.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.25.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.25.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.25.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.25.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.25.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.25.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.mlp.gate_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.mlp.up_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.k_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.o_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.q_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.26.self_attn.v_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.mlp.gate_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.mlp.up_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.k_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.o_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.q_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.27.self_attn.v_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.28.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.mlp.gate_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.28.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.mlp.up_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.28.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.self_attn.k_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.28.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.self_attn.o_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.28.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.self_attn.q_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.28.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.28.self_attn.v_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.29.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.mlp.gate_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.29.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.mlp.up_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.29.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.self_attn.k_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.29.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.self_attn.o_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.29.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.self_attn.q_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.29.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.29.self_attn.v_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.3.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.30.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.mlp.gate_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.30.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.mlp.up_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.30.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.self_attn.k_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.30.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.self_attn.o_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.30.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.self_attn.q_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.30.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.30.self_attn.v_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.31.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.mlp.gate_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.31.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.mlp.up_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.31.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.self_attn.k_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.31.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.self_attn.o_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.31.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.self_attn.q_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.31.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.31.self_attn.v_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.32.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.32.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.mlp.gate_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.32.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.mlp.up_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.32.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.self_attn.k_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.32.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.self_attn.o_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.32.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.self_attn.q_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.32.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.32.self_attn.v_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.33.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.33.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.mlp.gate_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.33.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.mlp.up_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.33.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.self_attn.k_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.33.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.self_attn.o_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.33.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.self_attn.q_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.33.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.33.self_attn.v_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.34.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.34.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.mlp.gate_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.34.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.mlp.up_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.34.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.self_attn.k_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.34.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.self_attn.o_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.34.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.self_attn.q_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.34.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.34.self_attn.v_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.35.input_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.mlp.down_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.35.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.mlp.gate_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.35.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.mlp.up_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.35.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.self_attn.k_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.35.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.self_attn.o_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.35.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.self_attn.q_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.35.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
+    "model.layers.35.self_attn.v_proj.weight_scale_inv": "model-00002-of-00003.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.4.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.5.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.6.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.7.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.8.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.mlp.down_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.mlp.gate_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.mlp.up_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.k_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.o_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.q_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
+    "model.layers.9.self_attn.v_proj.weight_scale_inv": "model-00001-of-00003.safetensors",
+    "model.norm.weight": "model-00002-of-00003.safetensors"
+  }
+}

modeling_fp8_qwen3.py ADDED Viewed

	@@ -0,0 +1,518 @@

+# Copyright 2025 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Tutorial: https://huggingface.co/docs/transformers/en/custom_models
+"""
+from typing import Callable, Optional, Union
+import torch
+from torch import nn
+from transformers.generation import GenerationMixin
+from transformers.cache_utils import Cache
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+from transformers.modeling_layers import GradientCheckpointingLayer
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.utils import TransformersKwargs, can_return_tuple
+from transformers.processing_utils import Unpack
+from transformers.utils import auto_docstring, logging
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from transformers.models.qwen3.modeling_qwen3 import (
+    Qwen3MLP,
+    Qwen3Attention,
+    apply_rotary_pos_emb,
+    eager_attention_forward,
+    Qwen3RMSNorm,
+    Qwen3RotaryEmbedding,
+    Qwen3Model,
+    Qwen3ForCausalLM,
+)
+from transformers.modeling_layers import (
+    GenericForQuestionAnswering,
+    GenericForSequenceClassification,
+    GenericForTokenClassification,
+    GradientCheckpointingLayer,
+)
+from .configuration_fp8_qwen3 import FP8Qwen3Config
+from torchao.float8.float8_training_tensor import Float8TrainingTensor
+from quasar.module import (
+    FP8Quant,
+    FP8RMSNorm,
+    FP8DSLinearWithCoat,
+    FP8DSLinearWithCoatWeightBlock,
+    FP8FusedSiLUMul,
+    FP8Identity,
+)
+from quasar.kernel.configs import FP8RMSNormConfig, QuantType, FP8MulConfig, FP8DSLinearWithCoatConfig, FP8QuantConfig
+from quasar.kernel.quant.quantize_hp2pb import fp8_quantize_hp2pb
+from quasar.kernel.quant.dequantize_pb2hp import fp8_dequantize_pb2hp
+logger = logging.get_logger(__name__)
+class FP8Qwen3MLP(Qwen3MLP):
+    def __init__(self, config: FP8Qwen3Config):
+        super().__init__(config)
+        linear_module = FP8DSLinearWithCoat if config.fp8_config.training_mode else FP8DSLinearWithCoatWeightBlock
+        self.gate_proj = linear_module(
+            self.hidden_size, self.intermediate_size, bias=False,
+            dsgemm_config=FP8DSLinearWithCoatConfig(layer_name=f"gate_proj", scale_dtype=torch.float32)
+        )
+        self.up_proj = linear_module(
+            self.hidden_size, self.intermediate_size, bias=False,
+            dsgemm_config=FP8DSLinearWithCoatConfig(layer_name=f"up_proj", scale_dtype=torch.float32)
+        )
+        self.down_proj = linear_module(
+            self.intermediate_size, self.hidden_size, bias=False,
+            dsgemm_config=FP8DSLinearWithCoatConfig(layer_name=f"down_proj", scale_dtype=torch.float32)
+        )
+        if config.hidden_act == "silu":
+            mul_config = FP8MulConfig(
+                quant_type=QuantType.MUL,
+                scale_dtype=torch.float32,
+            )
+            self.act_fn = FP8FusedSiLUMul(mul_config)
+        else:
+            raise ValueError(f"Unsupported activation function: {config.hidden_act}")
+    def forward(self, x):
+        gate_x = self.gate_proj(x)
+        up_x = self.up_proj(x)
+        mul_x = self.act_fn(gate_x, up_x)
+        down_proj = self.down_proj(mul_x)
+        return down_proj
+class FP8Qwen3Attention(Qwen3Attention):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: FP8Qwen3Config, layer_idx: int):
+        super().__init__(config, layer_idx)
+        linear_module = FP8DSLinearWithCoat if config.fp8_config.training_mode else FP8DSLinearWithCoatWeightBlock
+        self.q_proj = linear_module(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias,
+            dsgemm_config=FP8DSLinearWithCoatConfig(layer_name=f"q_proj", scale_dtype=torch.float32)
+        )
+        self.k_proj = linear_module(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias,
+            dsgemm_config=FP8DSLinearWithCoatConfig(layer_name=f"k_proj", scale_dtype=torch.float32)
+        )
+        self.v_proj = linear_module(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias,
+            dsgemm_config=FP8DSLinearWithCoatConfig(layer_name=f"v_proj", scale_dtype=torch.float32)
+        )
+        # In both training and inference, we quantize the output of the attention layer.
+        self.o_proj_quant = FP8Quant(
+            quant_config=FP8QuantConfig(
+                float8_dtype=config.fp8_config.float8_dtype,
+                quant_type=QuantType.DIV,
+                fwd_block_size=config.fp8_config.mm_block_size,
+                layer_name=f"o_proj_quant",
+                scale_dtype=torch.float32,
+            )
+        )
+        self.o_proj = linear_module(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias,
+            dsgemm_config=FP8DSLinearWithCoatConfig(
+                fwd_input_quant_type=QuantType.DIV,
+                layer_name=f"o_proj",
+                scale_dtype=torch.float32,
+            )
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
+        if isinstance(hidden_states, Float8TrainingTensor):
+            # Float8Tensor's last dim is quantize group size, not hidden size.
+            input_shape = hidden_states.shape[:-2]
+        else:
+            input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        # QKV-Proj
+        query_states = self.q_proj(hidden_states).view(hidden_shape)
+        key_states = self.k_proj(hidden_states).view(hidden_shape)
+        value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        # QK-Norm
+        query_states = self.q_norm(query_states).transpose(1, 2)
+        key_states = self.k_norm(key_states).transpose(1, 2)
+        # RoPE
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        # TODO: Add quantization
+        # Past-KV
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            sliding_window=self.sliding_window,  # diff with Llama
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        # Quantize the output of the attention layer.
+        attn_output = self.o_proj_quant(attn_output)
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class FP8Qwen3DecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: FP8Qwen3Config, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = FP8Qwen3Attention(config=config, layer_idx=layer_idx)
+        self.mlp = FP8Qwen3MLP(config)
+        self.input_layernorm = FP8RMSNorm(
+            config.hidden_size,
+            eps=config.rms_norm_eps,
+            norm_config=FP8RMSNormConfig(
+                mm_block_size=config.fp8_config.mm_block_size,
+                quant_type=QuantType.MUL,
+                scale_dtype=torch.float32,
+                save_fp8_input=True,
+            ),
+        )
+        self.post_attention_layernorm = FP8RMSNorm(
+            config.hidden_size,
+            eps=config.rms_norm_eps,
+            norm_config=FP8RMSNormConfig(
+                mm_block_size=config.fp8_config.mm_block_size,
+                quant_type=QuantType.MUL,
+                scale_dtype=torch.float32,
+                save_fp8_input=True,
+            ),
+        )
+        self.attention_type = config.layer_types[layer_idx]
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> tuple[torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, self_attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        return hidden_states
+@auto_docstring
+class FP8Qwen3PreTrainedModel(PreTrainedModel):
+    config_class = FP8Qwen3Config
+    config: FP8Qwen3Config
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["FP8Qwen3DecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _can_compile_fullgraph = True
+    _supports_attention_backend = True
+    _can_record_outputs = {
+        "hidden_states": FP8Qwen3DecoderLayer,
+        "attentions": FP8Qwen3Attention,
+    }
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, FP8RMSNorm):
+            module.weight.data.fill_(1.0)
+@auto_docstring
+class FP8Qwen3Model(FP8Qwen3PreTrainedModel):
+    config_class = FP8Qwen3Config
+    def __init__(self, config: FP8Qwen3Config):
+        super().__init__(config)
+        self.layers = nn.ModuleList(
+            [FP8Qwen3DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.norm = Qwen3RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.rotary_emb = Qwen3RotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        self.has_sliding_layers = "sliding_attention" in self.config.layer_types
+        # Initialize weights and apply final processing
+        self.post_init()
+    forward = Qwen3Model.forward
+@auto_docstring
+class FP8Qwen3ForCausalLM(FP8Qwen3PreTrainedModel, GenerationMixin):
+    config_class = FP8Qwen3Config
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = FP8Qwen3Model(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    set_decoder = Qwen3ForCausalLM.set_decoder
+    get_decoder = Qwen3ForCausalLM.get_decoder
+    # forward = Qwen3ForCausalLM.forward
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> CausalLMOutputWithPast:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, Qwen3ForCausalLM
+        >>> model = Qwen3ForCausalLM.from_pretrained("Qwen/Qwen3-8B")
+        >>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        outputs: BaseModelOutputWithPast = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class FP8Qwen3ForSequenceClassification(GenericForSequenceClassification, FP8Qwen3PreTrainedModel):
+    pass
+class FP8Qwen3ForTokenClassification(GenericForTokenClassification, FP8Qwen3PreTrainedModel):
+    pass
+class FP8Qwen3ForQuestionAnswering(GenericForQuestionAnswering, FP8Qwen3PreTrainedModel):
+    base_model_prefix = "transformer"  # For BC, where `transformer` was used instead of `model`
+__all__ = [
+    "FP8Qwen3Model",
+    "FP8Qwen3PreTrainedModel",
+    "FP8Qwen3ForCausalLM",
+    "FP8Qwen3ForSequenceClassification",
+    "FP8Qwen3ForTokenClassification",
+    "FP8Qwen3ForQuestionAnswering",
+]
+FP8Qwen3Model.register_for_auto_class("AutoModel")
+FP8Qwen3ForCausalLM.register_for_auto_class("AutoModelForCausalLM")
+def make_state_dict_compatible_with_hf(
+    state_dict: dict[str, torch.Tensor],
+    linear_keys: list[str],
+    undesired_linear_keys: list[str],
+    config: FP8Qwen3Config = FP8Qwen3Config(),
+    already_fp8: bool = False,
+) -> dict[str, torch.Tensor]:
+    """
+    Make the state dict compatible with HuggingFace.
+    """
+    # Assert linear keys and undesired linear keys are non-overlapping
+    assert set(linear_keys).isdisjoint(set(undesired_linear_keys))
+    compatible_state_dict = {}
+    for key in state_dict.keys():
+        if any(k in key for k in linear_keys):
+            weight = state_dict[key]
+            if already_fp8:
+                # The name (either weight or weight_scale_inv) is the same as the original key.
+                compatible_state_dict[key] = weight
+            else:
+                # We need to use float32 for the scale, since we are using DeepGEMM.
+                tmp_quant_cfg = FP8QuantConfig(
+                    float8_dtype=config.fp8_config.float8_dtype,
+                    quant_type=config.fp8_config.quant_type,
+                    fwd_block_size=config.fp8_config.mm_block_size,
+                    scale_dtype=torch.float32,
+                )
+                quant_weight, scale_weight = fp8_quantize_hp2pb(
+                    weight, tmp_quant_cfg, block_size=config.fp8_config.mm_block_size
+                )
+                name_quant = key.replace("weight", "weight")
+                name_scale = key.replace("weight", "weight_scale_inv")
+                compatible_state_dict[name_quant] = quant_weight
+                compatible_state_dict[name_scale] = scale_weight
+        elif any(k in key for k in undesired_linear_keys):
+            # Dequantize the weight
+            if already_fp8:
+                # We only do the dequantization once. When encountering the weight, we skip it.
+                if "weight_scale_inv" in key:
+                    name_quant = key.replace("weight_scale_inv", "weight")
+                    quant_weight = state_dict[name_quant]
+                    scale_weight = state_dict[key]
+                    weight = fp8_dequantize_pb2hp(quant_weight, scale_weight, config.fp8_config, block_size=config.fp8_config.mm_block_size)
+                    compatible_state_dict[name_quant] = weight
+            else:
+                # Do not quantize the weight.
+                compatible_state_dict[key] = state_dict[key]
+        else:
+            compatible_state_dict[key] = state_dict[key]
+    return compatible_state_dict
+def set_named_weight_to_fp8(
+    model: Qwen3ForCausalLM,
+    linear_keys: list[str] = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
+):
+    """
+    Set the dtype of the weight of the linear layers to FP8.
+    Also set layer name for debugging.
+    """
+    for name, module in model.named_modules():
+        # Match the name of the last module.
+        if name.split(".")[-1] in linear_keys:
+            module.weight.data = module.weight.data.to(torch.float8_e4m3fn)
+            module.weight_scale_inv.data = module.weight_scale_inv.data.to(torch.float32)
+            module.layer_name = name
+    return model