rubenz-org commited on
Commit
f393153
·
verified ·
1 Parent(s): 7204a55

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Qwen2.5 0.5B Instruct MLC-LLM
3
+ language:
4
+ - en
5
+ library_name: mlc-llm
6
+ tags:
7
+ - qwen2.5
8
+ - mlc-llm
9
+ - webgpu
10
+ - webllm
11
+ - quantized
12
+ model_type: qwen2
13
+ license: apache-2.0
14
+ ---
15
+
16
+ # Qwen2.5 0.5B Instruct MLC-LLM
17
+
18
+ This is a quantized and optimized version of Qwen2.5 0.5B Instruct compiled with MLC-LLM for WebGPU deployment in browsers.
19
+
20
+ ## Model Details
21
+
22
+ - **Base Model**: Qwen2.5 0.5B Instruct
23
+ - **Quantization**: q4f32_1 (4-bit quantization with float32 scale)
24
+ - **Context Window**: 2048 tokens
25
+ - **Prefill Chunk Size**: 512 tokens
26
+ - **Target**: WebGPU deployment via WebLLM
27
+ - **Memory Usage**: ~800MB VRAM
28
+ - **Total Parameters**: 494,032,768
29
+ - **Bits per Parameter**: 5.004
30
+
31
+ ## Usage
32
+
33
+ ### With WebLLM
34
+
35
+ ```javascript
36
+ import * as webllm from '@mlc-ai/web-llm';
37
+
38
+ const customModels = [
39
+ {
40
+ model: "https://huggingface.co/rubenz-org/qwen2-5-0-5b-instruct-mlc",
41
+ model_id: "qwen2.5-0.5b-instruct-custom",
42
+ // Use official WebLLM WASM runtime for compatibility
43
+ model_lib: "https://raw.githubusercontent.com/mlc-ai/binary-mlc-llm-libs/main/Qwen2.5-0.5B-Instruct-q4f32_1-MLC-1k.wasm",
44
+ vram_required_MB: 800,
45
+ low_resource_required: true,
46
+ overrides: {
47
+ context_window_size: 2048,
48
+ prefill_chunk_size: 512,
49
+ },
50
+ },
51
+ ];
52
+
53
+ const engine = await webllm.CreateMLCEngine("qwen2.5-0.5b-instruct-custom", {
54
+ appConfig: { model_list: customModels }
55
+ });
56
+ ```
57
+
58
+ ### Example Chat
59
+
60
+ ```javascript
61
+ const response = await engine.chat.completions.create({
62
+ messages: [{ role: 'user', content: 'Hello! How can you help me today?' }],
63
+ stream: true,
64
+ temperature: 0.7,
65
+ max_tokens: 512,
66
+ });
67
+
68
+ for await (const chunk of response) {
69
+ const content = chunk.choices[0]?.delta?.content || '';
70
+ if (content) {
71
+ console.log(content);
72
+ }
73
+ }
74
+ ```
75
+
76
+ ## Files
77
+
78
+ - `mlc-chat-config.json`: Model configuration for MLC-LLM
79
+ - `params_shard_*.bin`: Quantized model parameters (8 shards)
80
+ - `tokenizer.json`, `vocab.json`, `merges.txt`: Tokenizer files
81
+ - `tensor-cache.json`: Parameter metadata cache
82
+ - Uses official WebLLM WASM runtime for compatibility
83
+
84
+ ## Performance
85
+
86
+ This model is optimized for browser deployment with:
87
+ - Reduced memory footprint through 4-bit quantization
88
+ - WebGPU acceleration for efficient inference
89
+ - Chunked prefill for better memory management
90
+ - Low resource requirements for edge deployment
91
+ - Compatible with official WebLLM runtime libraries
92
+
93
+ ## Browser Compatibility
94
+
95
+ - Chrome 113+ with WebGPU enabled
96
+ - Edge 113+ with WebGPU enabled
97
+ - Firefox with WebGPU experimental support
98
+
99
+ ## Memory Requirements
100
+
101
+ - **Without KV cache**: 629.45 MB
102
+ - **With 4K KV cache**: 725.45 MB
103
+ - **Parameters**: 294.70 MB
104
+ - **Temporary buffer**: 334.75 MB
105
+
106
+ ## License
107
+
108
+ This model follows the same license as the original Qwen2.5 model (Apache 2.0).