Llama-3.2-3B-Instruct: Optimized for SiMa.ai Modalix

Overview

This repository contains the Llama-3.2-3B-Instruct model, optimized and compiled for the SiMa.ai Modalix platform for text-only inference.

Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture (3.21B parameters).
Quantization: Hybrid
- Prompt Processing: A16W8 (16-bit activations, 8-bit weights)
- Token Generation: A16W4 (16-bit activations, 4-bit weights)
Maximum context length: 2048
Source Model: meta-llama/Llama-3.2-3B-Instruct

Performance

The following performance metrics were measured with an input sequence length of 128 tokens.

Model	Precision	Device	Response Rate (tokens/sec)	Time To First Token (sec)
Llama-3.2-3B-Instruct	A16W8/A16W4	Modalix	19.2 tokens/sec	0.12 sec

Prerequisites

To run this model, you need:

SiMa.ai Modalix Device
SiMa.ai CLI: Installed on your Modalix device.
Hugging Face CLI: For downloading the model.

Installation & Deployment

Follow these steps to deploy the model to your Modalix device.

1. Install LLiMa Demo Application

Note: This is a one-time setup. If you have already installed the LLiMa demo application (e.g. for another model), you can skip this step and continue with model download.

On your Modalix device, install the LLiMa demo application using the sima-cli:

# Create a directory for LLiMa
cd /media/nvme
mkdir llima
cd llima

# Install the LLiMa runtime code
sima-cli install -v 2.0.0 samples/llima -t select

Note: To only download the LLiMa runtime code, select 🚫 Skip when prompted.

2. Download the Model

Download the compiled model assets from this repository directly to your device.

# Download the model to a local directory
cd /media/nvme/llima
hf download meta-llama/Llama-3.2-3B-Instruct --local-dir Llama-3.2-3B-Instruct-a16w4

Alternatively, you can download the compiled model to a Host and copy it to the Modalix device:

hf download meta-llama/Llama-3.2-3B-Instruct --local-dir Llama-3.2-3B-Instruct-a16w4
scp -r Llama-3.2-3B-Instruct-a16w4 sima@<modalix-ip>:/media/nvme/llima/

Replace <modalix-ip> with the IP address of your Modalix device.

Expected Directory Structure:

/media/nvme/llima/
├── simaai-genai-demo/   # The demo app
└── Llama-3.2-3B-Instruct-a16w4/        # Your downloaded model

Usage

Run the Application

Navigate to the demo directory and start the application:

cd /media/nvme/llima/simaai-genai-demo
./run.sh

The script will detect the installed model(s) and prompt you to select one.

Once the application is running, open a browser and navigate to:

https://<modalix-ip>:5000/

Replace <modalix-ip> with the IP address of your Modalix device.

API Usage

To use OpenAI-compatible API, run the model in API mode:

cd /media/nvme/llima/simaai-genai-demo
./run.sh --httponly --api-only

You can interact with it using curl or Python.

Example: Chat Completion

curl -N -k -X POST "https://<modalix-ip>:5000/v1/chat/completions" \\
  -H "Content-Type: application/json" \\
  -d '{
    "messages": [
      { "role": "user", "content": "Why is the sky blue?" }
    ],
    "stream": true
  }'

Replace <modalix-ip> with the IP address of your Modalix device.

Limitations

Quantization: This model is quantized (A16W4/A16W8) for optimal performance on embedded devices. While this maintains high accuracy, minor deviations from the full-precision model may occur.

Troubleshooting

sima-cli not found: Ensure that sima-cli is installed on your Modalix device.
Model can't be run: Verify the model directory is exactly inside /media/nvme/llima/ and not nested (e.g., /media/nvme/llima/Llama-3.2-3B-Instruct-a16w4/Llama-3.2-3B-Instruct-a16w4).
Permission Denied: Ensure you have read/write permissions for the /media/nvme directory.

Resources

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for simaai/Llama-3.2-3B-Instruct-a16w4

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

(822)

this model