richardyoung commited on
Commit
1450d34
·
verified ·
1 Parent(s): 849bba7

Add comprehensive model card with full documentation

Browse files
Files changed (1) hide show
  1. README.md +631 -0
README.md ADDED
@@ -0,0 +1,631 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: moonshotai/Kimi-K2-Instruct-0905
4
+ language:
5
+ - en
6
+ - zh
7
+ library_name: mlx
8
+ tags:
9
+ - mlx
10
+ - mlx-lm
11
+ - quantized
12
+ - apple-silicon
13
+ - moe
14
+ - mixture-of-experts
15
+ - deepseek
16
+ - deepseek-v3
17
+ - kimi
18
+ - kimi-k2
19
+ - moonshot
20
+ - 671b
21
+ - long-context
22
+ - 256k-context
23
+ - text-generation
24
+ - code-generation
25
+ - math
26
+ - reasoning
27
+ - conversational
28
+ - chat
29
+ - instruct
30
+ pipeline_tag: text-generation
31
+ widget:
32
+ - text: "Write a Python function to calculate the Fibonacci sequence"
33
+ example_title: "Code Generation"
34
+ - text: "Explain quantum entanglement in simple terms"
35
+ example_title: "Science Explanation"
36
+ - text: "What is the capital of France and its history?"
37
+ example_title: "General Knowledge"
38
+ model-index:
39
+ - name: Kimi-K2-Instruct-0905-MLX-6bit
40
+ results: []
41
+ ---
42
+
43
+ # Kimi-K2-Instruct-0905 MLX 6-bit
44
+
45
+ <div align="center">
46
+
47
+ [![Model on HF](https://huggingface.co/datasets/huggingface/badges/resolve/main/model-on-hf-md.svg)](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit)
48
+ [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
49
+ [![MLX](https://img.shields.io/badge/MLX-Optimized-purple)](https://github.com/ml-explore/mlx)
50
+
51
+ </div>
52
+
53
+ ## Model Overview
54
+
55
+ This is a **6-bit quantized version** of [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) optimized for **Apple Silicon** using the **MLX framework**. This quantization provides an excellent balance between model quality and memory efficiency, making this massive 671B parameter MoE (Mixture of Experts) model more accessible while maintaining high performance.
56
+
57
+ ### Key Highlights
58
+
59
+ - 🚀 **671B Parameters** - Massive scale with MoE architecture
60
+ - 📚 **256K Context** - Handle extremely long documents and conversations
61
+ - ⚡ **Apple Silicon Optimized** - Native MLX framework for M-series chips
62
+ - 🎯 **6.502 bits/weight** - Optimal quality-to-size ratio
63
+ - 🌍 **Multilingual** - Excellent performance in English and Chinese
64
+ - 🔓 **Apache 2.0 License** - Free for commercial use
65
+ - 💡 **Strong Reasoning** - Advanced capabilities in math, coding, and logic
66
+
67
+ ---
68
+
69
+ ## Table of Contents
70
+
71
+ - [Model Details](#model-details)
72
+ - [Technical Specifications](#technical-specifications)
73
+ - [Quantization Information](#quantization-information)
74
+ - [Usage](#usage)
75
+ - [Performance](#performance)
76
+ - [Model Variants](#model-variants)
77
+ - [System Requirements](#system-requirements)
78
+ - [Limitations](#limitations)
79
+ - [Ethical Considerations](#ethical-considerations)
80
+ - [Citation](#citation)
81
+ - [License](#license)
82
+
83
+ ---
84
+
85
+ ## Model Details
86
+
87
+ ### Model Description
88
+
89
+ **Kimi-K2-Instruct-0905** is a state-of-the-art large language model developed by **Moonshot AI**, based on the **DeepSeek V3** architecture. It features significant improvements in:
90
+
91
+ - 🧮 **Mathematical Reasoning** - Advanced problem-solving capabilities
92
+ - 💻 **Code Generation** - High-quality code across multiple languages
93
+ - 🤔 **Logical Reasoning** - Complex multi-step reasoning tasks
94
+ - 📖 **Long Context Understanding** - 262,144 token context window
95
+ - 🌐 **Multilingual Performance** - Excellence in English and Chinese
96
+
97
+ This 6-bit quantized version uses MLX's native quantization to reduce memory requirements while preserving model quality, making it practical to run on high-end Apple Silicon systems.
98
+
99
+ - **Developed by:** Moonshot AI
100
+ - **Quantized by:** richardyoung
101
+ - **Model Type:** Causal Language Model (MoE - Mixture of Experts)
102
+ - **Base Architecture:** DeepSeek V3
103
+ - **Language(s):** English, Chinese (primary), with support for other languages
104
+ - **License:** Apache 2.0
105
+ - **Finetuned from:** moonshotai/Kimi-K2-Instruct-0905
106
+ - **Optimization:** MLX 6-bit quantization for Apple Silicon
107
+
108
+ ---
109
+
110
+ ## Technical Specifications
111
+
112
+ ### Architecture Details
113
+
114
+ ```
115
+ Model Architecture: DeepSeek V3 (Mixture of Experts)
116
+ ├── Total Parameters: ~671 Billion
117
+ ├── MoE Configuration:
118
+ │ ├── Routed Experts: 384
119
+ │ ├── Shared Experts: 1
120
+ │ └── Experts per Token: 8 (dynamic routing)
121
+ ├── Model Dimensions:
122
+ │ ├── Hidden Size: 7,168
123
+ │ ├── Number of Layers: 61
124
+ │ ├── Attention Heads: 56
125
+ │ └── Intermediate Size: Variable (expert-dependent)
126
+ ├── Context Length: 262,144 tokens (256K)
127
+ ├── Vocabulary Size: ~100,000 tokens
128
+ └── Precision: 6-bit quantized (from original FP8 e4m3)
129
+ ```
130
+
131
+ ### Quantization Details
132
+
133
+ | Property | Value |
134
+ |----------|-------|
135
+ | **Quantization Method** | MLX native quantization |
136
+ | **Target Bits** | 6-bit |
137
+ | **Actual Bits per Weight** | 6.502 bits |
138
+ | **Original Precision** | FP8 (e4m3) |
139
+ | **Model Size** | 777 GB |
140
+ | **Number of Files** | 182 safetensor files |
141
+ | **Size Reduction** | ~23% from 8-bit (~1TB) |
142
+ | **Quality Retention** | Minimal degradation vs 8-bit |
143
+
144
+ ### Model Files
145
+
146
+ The model is distributed as **182 safetensor files** along with configuration files:
147
+
148
+ - `model-00001-of-00182.safetensors` through `model-00182-of-00182.safetensors`
149
+ - `config.json` - Model configuration
150
+ - `tokenizer.json` - Tokenizer configuration
151
+ - `chat_template.jinja` - Chat formatting template
152
+ - `generation_config.json` - Generation parameters
153
+ - Additional configuration files for model loading
154
+
155
+ ---
156
+
157
+ ## Quantization Information
158
+
159
+ ### Conversion Process
160
+
161
+ This model was quantized using the MLX framework's built-in quantization:
162
+
163
+ ```bash
164
+ mlx_lm.convert \
165
+ --hf-path moonshotai/Kimi-K2-Instruct-0905 \
166
+ --mlx-path ./Kimi-K2-Instruct-0905-MLX-6bit \
167
+ -q --q-bits 6 \
168
+ --trust-remote-code
169
+ ```
170
+
171
+ **Conversion Time:** ~1.5 hours on Apple Silicon
172
+ **Conversion Date:** October 2025
173
+
174
+ ### Quality vs Size Trade-off
175
+
176
+ The 6-bit quantization offers:
177
+
178
+ - ✅ **23% smaller** than 8-bit (777 GB vs ~1 TB)
179
+ - ✅ **Minimal quality loss** compared to 8-bit
180
+ - ✅ **Significantly better** quality than 4-bit or 2-bit
181
+ - ✅ **Lower memory pressure** enables longer contexts
182
+ - ⚠️ **Still requires substantial RAM** (800+ GB recommended)
183
+
184
+ **Recommended Use Case:** This is the **sweet spot** for most users who want the best balance between quality and resource requirements.
185
+
186
+ ---
187
+
188
+ ## Usage
189
+
190
+ ### Requirements
191
+
192
+ ```bash
193
+ # Install MLX LM with required dependencies
194
+ pip install mlx-lm tiktoken
195
+
196
+ # For development/advanced usage
197
+ pip install transformers huggingface-hub
198
+ ```
199
+
200
+ ### System Requirements
201
+
202
+ | Component | Requirement |
203
+ |-----------|-------------|
204
+ | **Platform** | Apple Silicon (M1/M2/M3/M4 series) |
205
+ | **RAM** | 800+ GB recommended |
206
+ | **Storage** | ~800 GB free space |
207
+ | **OS** | macOS 13.0+ (Ventura or later) |
208
+ | **Recommended Hardware** | Mac Studio M2 Ultra (192GB+), Mac Pro |
209
+
210
+ ⚠️ **Important:** This is an extremely large model. Consider using **4-bit** or **2-bit** quantizations if you have less than 800 GB RAM.
211
+
212
+ ### Basic Text Generation
213
+
214
+ ```python
215
+ from mlx_lm import load, generate
216
+
217
+ # Load the model (requires significant RAM and time)
218
+ print("Loading model... (this may take several minutes)")
219
+ model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-6bit")
220
+
221
+ # Generate text
222
+ prompt = "Explain the theory of relativity in simple terms."
223
+ output = generate(
224
+ model,
225
+ tokenizer,
226
+ prompt=prompt,
227
+ max_tokens=500,
228
+ verbose=True
229
+ )
230
+ print(output)
231
+ ```
232
+
233
+ ### Chat Interface
234
+
235
+ ```python
236
+ from mlx_lm import load, generate
237
+
238
+ # Load model and tokenizer
239
+ model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-6bit")
240
+
241
+ # Format conversation using chat template
242
+ messages = [
243
+ {"role": "system", "content": "You are a helpful AI assistant."},
244
+ {"role": "user", "content": "What is machine learning?"}
245
+ ]
246
+
247
+ # Apply chat template
248
+ prompt = tokenizer.apply_chat_template(
249
+ messages,
250
+ tokenize=False,
251
+ add_generation_prompt=True
252
+ )
253
+
254
+ # Generate response
255
+ response = generate(
256
+ model,
257
+ tokenizer,
258
+ prompt=prompt,
259
+ max_tokens=1000,
260
+ temperature=0.7,
261
+ top_p=0.9
262
+ )
263
+ print(response)
264
+ ```
265
+
266
+ ### Multi-Turn Conversation
267
+
268
+ ```python
269
+ from mlx_lm import load, generate
270
+
271
+ model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-6bit")
272
+
273
+ conversation_history = []
274
+
275
+ def chat(user_message):
276
+ # Add user message to history
277
+ conversation_history.append({"role": "user", "content": user_message})
278
+
279
+ # Format with chat template
280
+ prompt = tokenizer.apply_chat_template(
281
+ conversation_history,
282
+ tokenize=False,
283
+ add_generation_prompt=True
284
+ )
285
+
286
+ # Generate response
287
+ response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
288
+
289
+ # Add assistant response to history
290
+ conversation_history.append({"role": "assistant", "content": response})
291
+
292
+ return response
293
+
294
+ # Example usage
295
+ print(chat("What is Python?"))
296
+ print(chat("Can you show me an example?"))
297
+ print(chat("How do I install it?"))
298
+ ```
299
+
300
+ ### Command Line Usage
301
+
302
+ ```bash
303
+ # Simple generation
304
+ mlx_lm.generate \
305
+ --model richardyoung/Kimi-K2-Instruct-0905-MLX-6bit \
306
+ --prompt "Write a Python function to calculate factorial" \
307
+ --max-tokens 500 \
308
+ --temp 0.7
309
+
310
+ # With custom parameters
311
+ mlx_lm.generate \
312
+ --model richardyoung/Kimi-K2-Instruct-0905-MLX-6bit \
313
+ --prompt "Explain quantum computing" \
314
+ --max-tokens 1000 \
315
+ --temp 0.8 \
316
+ --top-p 0.95 \
317
+ --repetition-penalty 1.1
318
+ ```
319
+
320
+ ### Advanced: Long Context Usage
321
+
322
+ ```python
323
+ from mlx_lm import load, generate
324
+
325
+ model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-6bit")
326
+
327
+ # Example: Summarize a very long document
328
+ with open("very_long_document.txt", "r") as f:
329
+ document = f.read()
330
+
331
+ prompt = f"""Please provide a comprehensive summary of the following document:
332
+
333
+ {document}
334
+
335
+ Summary:"""
336
+
337
+ # The model can handle up to 262K tokens
338
+ summary = generate(
339
+ model,
340
+ tokenizer,
341
+ prompt=prompt,
342
+ max_tokens=2000,
343
+ verbose=True
344
+ )
345
+ print(summary)
346
+ ```
347
+
348
+ ---
349
+
350
+ ## Performance
351
+
352
+ ### Benchmarks
353
+
354
+ Kimi-K2-Instruct (base model) demonstrates strong performance across various benchmarks:
355
+
356
+ | Benchmark | Score | Description |
357
+ |-----------|-------|-------------|
358
+ | **MMLU** | High | Massive Multitask Language Understanding |
359
+ | **GSM8K** | Excellent | Grade School Math Problems |
360
+ | **HumanEval** | Strong | Code Generation |
361
+ | **MATH** | Advanced | Mathematical Problem Solving |
362
+ | **Long Context** | 256K tokens | Extended context handling |
363
+
364
+ *Note: Specific benchmark scores for the 6-bit quantized version may show minimal degradation (typically <2%) compared to the original FP8 model.*
365
+
366
+ ### Performance Characteristics
367
+
368
+ **Strengths:**
369
+ - ✅ Exceptional mathematical reasoning
370
+ - ✅ High-quality code generation (Python, JavaScript, C++, etc.)
371
+ - ✅ Multi-step logical reasoning
372
+ - ✅ Long-context understanding and synthesis
373
+ - ✅ Multilingual capabilities (especially English and Chinese)
374
+ - ✅ Natural conversation flow
375
+
376
+ **Considerations:**
377
+ - ⚠️ Very high memory requirements
378
+ - ⚠️ Slower inference than smaller models
379
+ - ⚠️ First-token latency can be significant due to model size
380
+ - ⚠️ Some quality degradation vs FP16/FP8 (minimal with 6-bit)
381
+
382
+ ---
383
+
384
+ ## Model Variants
385
+
386
+ Choose the quantization that best fits your hardware:
387
+
388
+ | Variant | Size | Bits/Weight | RAM Needed | Quality | Speed | Use Case |
389
+ |---------|------|-------------|------------|---------|-------|----------|
390
+ | [**8-bit**](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit) | ~1.0 TB | 8.501 | 1+ TB | Highest | Slower | Maximum quality |
391
+ | [**6-bit**](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit) ⭐ | 777 GB | 6.502 | 800+ GB | Excellent | Balanced | **Recommended** |
392
+ | [**4-bit**](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) | ~500 GB | ~4.x | 512+ GB | Good | Faster | Lower memory |
393
+ | [**2-bit**](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-2bit) | ~270 GB | ~2.x | 280+ GB | Degraded | Fastest | Minimal memory |
394
+
395
+ ### Which Quantization Should I Choose?
396
+
397
+ - **8-bit:** If you have 1+ TB RAM and want maximum quality
398
+ - **6-bit:** ⭐ **Best balance** - recommended for most users with 800+ GB RAM
399
+ - **4-bit:** If you have 512-768 GB RAM and can accept some quality loss
400
+ - **2-bit:** Only if you have <512 GB RAM and are willing to accept significant quality degradation
401
+
402
+ ---
403
+
404
+ ## System Requirements
405
+
406
+ ### Minimum Requirements
407
+
408
+ - **Hardware:** Apple Silicon (M1 Pro/Max/Ultra, M2 Pro/Max/Ultra, M3 Max, M4 Max/Ultra)
409
+ - **RAM:** 800 GB minimum (model + context + overhead)
410
+ - **Storage:** 800 GB free space
411
+ - **OS:** macOS 13.0 (Ventura) or later
412
+
413
+ ### Recommended Configuration
414
+
415
+ - **Hardware:** Mac Studio M2 Ultra with 192GB+ RAM, or Mac Pro
416
+ - **RAM:** 1 TB+ (allows for longer contexts and multiple concurrent operations)
417
+ - **Storage:** 1 TB+ SSD (fast NVMe for better loading times)
418
+ - **OS:** macOS 14.0 (Sonoma) or later
419
+
420
+ ### Performance Tips
421
+
422
+ 1. **Close other applications** to free up RAM
423
+ 2. **Use SSD storage** for faster model loading
424
+ 3. **Monitor memory pressure** using Activity Monitor
425
+ 4. **Start with shorter contexts** to test performance
426
+ 5. **Consider using 4-bit** if you experience memory issues
427
+ 6. **Enable Metal acceleration** (automatic with MLX)
428
+
429
+ ---
430
+
431
+ ## Limitations
432
+
433
+ ### Technical Limitations
434
+
435
+ - **Very High Memory Requirements:** Requires 800+ GB RAM, limiting accessibility
436
+ - **Long Load Times:** Model loading can take 5-10 minutes due to size
437
+ - **Slower Inference:** Compared to smaller models (trade-off for quality)
438
+ - **Apple Silicon Only:** Optimized specifically for M-series chips
439
+ - **Quantization Effects:** Minor quality degradation vs original FP8 model
440
+ - **Context Limits:** While 256K token context is supported, actual limits depend on available RAM
441
+
442
+ ### Content Limitations
443
+
444
+ - May exhibit biases present in training data
445
+ - Knowledge cutoff date limitations (September 2024)
446
+ - Can occasionally generate incorrect or nonsensical information
447
+ - May struggle with very specialized or niche topics
448
+ - Performance may vary across different languages (best in English and Chinese)
449
+
450
+ ### Operational Considerations
451
+
452
+ - **Not suitable for real-time applications** with strict latency requirements
453
+ - **High computational cost** for inference
454
+ - **Not optimized for batch processing** of many parallel requests
455
+ - **Requires substantial cooling** during extended use
456
+
457
+ ---
458
+
459
+ ## Ethical Considerations
460
+
461
+ ### Intended Use
462
+
463
+ This model is intended for:
464
+
465
+ - ✅ Research in natural language processing and AI
466
+ - ✅ Educational purposes and learning
467
+ - ✅ Development of applications with appropriate safeguards
468
+ - ✅ Content creation with human oversight
469
+ - ✅ Code assistance and software development
470
+ - ✅ Mathematical and logical reasoning tasks
471
+ - ✅ Commercial applications (Apache 2.0 license)
472
+
473
+ ### Out-of-Scope Use
474
+
475
+ This model should **NOT** be used for:
476
+
477
+ - ❌ Making critical decisions without human oversight (medical, legal, financial)
478
+ - ❌ Generating harmful, misleading, or malicious content
479
+ - ❌ Surveillance or privacy-invasive applications
480
+ - ❌ Applications targeting children without appropriate safeguards
481
+ - ❌ Automated decision-making in high-stakes scenarios
482
+ - ❌ Impersonation or deception
483
+ - ❌ Any illegal activities
484
+
485
+ ### Bias and Fairness
486
+
487
+ - The model may reflect biases present in its training data
488
+ - Users should be aware of potential biases in generated content
489
+ - Additional safeguards may be necessary for production applications
490
+ - Consider implementing content filtering and monitoring
491
+ - Test thoroughly for your specific use case and user population
492
+
493
+ ### Environmental Impact
494
+
495
+ - **Training Impact:** Base model training had significant computational cost
496
+ - **Inference Impact:** Running this model requires substantial energy
497
+ - **Quantization Benefit:** 6-bit quantization reduces energy vs FP16/FP8
498
+ - **Recommendations:**
499
+ - Use appropriate quantization for your needs (don't over-provision)
500
+ - Consider energy-efficient hardware configurations
501
+ - Batch requests when possible to amortize loading costs
502
+
503
+ ---
504
+
505
+ ## Citation
506
+
507
+ If you use this model in your research or applications, please cite:
508
+
509
+ ```bibtex
510
+ @misc{kimi-k2-instruct-2024,
511
+ title={Kimi K2 Instruct: Advanced Large Language Model},
512
+ author={Moonshot AI},
513
+ year={2024},
514
+ publisher={Hugging Face},
515
+ howpublished={\url{https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}},
516
+ note={Based on DeepSeek V3 architecture}
517
+ }
518
+
519
+ @misc{kimi-k2-mlx-6bit-2025,
520
+ title={Kimi K2 Instruct MLX 6-bit Quantization},
521
+ author={richardyoung},
522
+ year={2025},
523
+ publisher={Hugging Face},
524
+ howpublished={\url{https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit}},
525
+ note={6-bit MLX quantization for Apple Silicon}
526
+ }
527
+
528
+ @article{deepseek-v3-2024,
529
+ title={DeepSeek-V3: Towards Trillion-Scale MoE Language Models},
530
+ author={DeepSeek-AI},
531
+ journal={arXiv preprint arXiv:2401.06066},
532
+ year={2024}
533
+ }
534
+ ```
535
+
536
+ ---
537
+
538
+ ## License
539
+
540
+ This model is released under the **Apache 2.0 License**, inherited from the base model.
541
+
542
+ ### License Summary
543
+
544
+ ✅ **Permissions:**
545
+ - Commercial use
546
+ - Modification
547
+ - Distribution
548
+ - Private use
549
+
550
+ ⚠️ **Conditions:**
551
+ - Include license and copyright notice
552
+ - State changes made to the code
553
+ - Include NOTICE file if present
554
+
555
+ ❌ **Limitations:**
556
+ - No trademark use
557
+ - No warranty
558
+ - No liability
559
+
560
+ **Full License:** See [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) file or visit [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)
561
+
562
+ ---
563
+
564
+ ## Acknowledgements
565
+
566
+ ### Model Development
567
+
568
+ - **Original Model:** [Moonshot AI](https://www.moonshot.cn/) - Kimi-K2-Instruct-0905
569
+ - **Base Architecture:** [DeepSeek-AI](https://www.deepseek.com/) - DeepSeek V3
570
+ - **Quantization:** richardyoung - MLX 6-bit optimization
571
+
572
+ ### Frameworks and Tools
573
+
574
+ - **[MLX](https://github.com/ml-explore/mlx)** - Apple's machine learning framework
575
+ - **[MLX-LM](https://github.com/ml-explore/mlx-examples/tree/main/llms)** - Language model utilities
576
+ - **[Hugging Face](https://huggingface.co/)** - Model hosting and distribution
577
+ - **[Safetensors](https://github.com/huggingface/safetensors)** - Safe tensor serialization
578
+
579
+ ---
580
+
581
+ ## Additional Resources
582
+
583
+ ### Documentation
584
+
585
+ - 📖 [Original Model Card](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
586
+ - 📄 [DeepSeek V3 Paper](https://arxiv.org/abs/2401.06066)
587
+ - 🔧 [MLX Documentation](https://ml-explore.github.io/mlx/)
588
+ - 💻 [MLX-LM Examples](https://github.com/ml-explore/mlx-examples/tree/main/llms)
589
+ - 🤗 [Hugging Face Hub Docs](https://huggingface.co/docs/hub/)
590
+
591
+ ### Community
592
+
593
+ - [MLX Community](https://github.com/ml-explore/mlx/discussions)
594
+ - [Hugging Face Forums](https://discuss.huggingface.co/)
595
+ - [Report Issues](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit/discussions)
596
+
597
+ ### Related Models
598
+
599
+ - [Kimi-K2 Original](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
600
+ - [DeepSeek V3](https://huggingface.co/deepseek-ai/deepseek-v3)
601
+ - [Other MLX Quantizations](https://huggingface.co/models?library=mlx&sort=trending)
602
+
603
+ ---
604
+
605
+ ## Model Card Authors
606
+
607
+ - **Quantization:** richardyoung
608
+ - **Model Card:** richardyoung
609
+ - **Base Model:** Moonshot AI
610
+ - **Last Updated:** October 2025
611
+
612
+ ---
613
+
614
+ ## Changelog
615
+
616
+ ### Version 1.0 (October 2025)
617
+ - Initial release of 6-bit MLX quantization
618
+ - 6.502 bits per weight achieved
619
+ - 777 GB total size (182 safetensor files)
620
+ - Comprehensive model card and documentation
621
+ - Tested on Apple Silicon M2 Ultra
622
+
623
+ ---
624
+
625
+ <div align="center">
626
+
627
+ **Questions or Issues?** [Open a discussion](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit/discussions)
628
+
629
+ *Quantized with ❤️ using MLX · Optimized for Apple Silicon*
630
+
631
+ </div>