rubenz-org
/

qwen2-5-0-5b-instruct-mlc

+---
+title: Qwen2.5 0.5B Instruct MLC-LLM
+language:
+- en
+library_name: mlc-llm
+tags:
+- qwen2.5
+- mlc-llm
+- webgpu
+- webllm
+- quantized
+model_type: qwen2
+license: apache-2.0
+---
+# Qwen2.5 0.5B Instruct MLC-LLM
+This is a quantized and optimized version of Qwen2.5 0.5B Instruct compiled with MLC-LLM for WebGPU deployment in browsers.
+## Model Details
+- **Base Model**: Qwen2.5 0.5B Instruct
+- **Quantization**: q4f32_1 (4-bit quantization with float32 scale)
+- **Context Window**: 2048 tokens
+- **Prefill Chunk Size**: 512 tokens
+- **Target**: WebGPU deployment via WebLLM
+- **Memory Usage**: ~800MB VRAM
+- **Total Parameters**: 494,032,768
+- **Bits per Parameter**: 5.004
+## Usage
+### With WebLLM
+```javascript
+import * as webllm from '@mlc-ai/web-llm';
+const customModels = [
+  {
+    model: "https://huggingface.co/rubenz-org/qwen2-5-0-5b-instruct-mlc",
+    model_id: "qwen2.5-0.5b-instruct-custom",
+    // Use official WebLLM WASM runtime for compatibility
+    model_lib: "https://raw.githubusercontent.com/mlc-ai/binary-mlc-llm-libs/main/Qwen2.5-0.5B-Instruct-q4f32_1-MLC-1k.wasm",
+    vram_required_MB: 800,
+    low_resource_required: true,
+    overrides: {
+      context_window_size: 2048,
+      prefill_chunk_size: 512,
+    },
+  },
+];
+const engine = await webllm.CreateMLCEngine("qwen2.5-0.5b-instruct-custom", {
+  appConfig: { model_list: customModels }
+});
+```
+### Example Chat
+```javascript
+const response = await engine.chat.completions.create({
+  messages: [{ role: 'user', content: 'Hello! How can you help me today?' }],
+  stream: true,
+  temperature: 0.7,
+  max_tokens: 512,
+});
+for await (const chunk of response) {
+  const content = chunk.choices[0]?.delta?.content || '';
+  if (content) {
+    console.log(content);
+  }
+}
+```
+## Files
+- `mlc-chat-config.json`: Model configuration for MLC-LLM
+- `params_shard_*.bin`: Quantized model parameters (8 shards)
+- `tokenizer.json`, `vocab.json`, `merges.txt`: Tokenizer files
+- `tensor-cache.json`: Parameter metadata cache
+- Uses official WebLLM WASM runtime for compatibility
+## Performance
+This model is optimized for browser deployment with:
+- Reduced memory footprint through 4-bit quantization
+- WebGPU acceleration for efficient inference
+- Chunked prefill for better memory management
+- Low resource requirements for edge deployment
+- Compatible with official WebLLM runtime libraries
+## Browser Compatibility
+- Chrome 113+ with WebGPU enabled
+- Edge 113+ with WebGPU enabled
+- Firefox with WebGPU experimental support
+## Memory Requirements
+- **Without KV cache**: 629.45 MB
+- **With 4K KV cache**: 725.45 MB
+- **Parameters**: 294.70 MB
+- **Temporary buffer**: 334.75 MB
+## License
+This model follows the same license as the original Qwen2.5 model (Apache 2.0).