Delete README.md
Browse files
README.md
DELETED
|
@@ -1,207 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
-
library_name: pytorch
|
| 6 |
-
tags:
|
| 7 |
-
- transformer
|
| 8 |
-
- decoder-only
|
| 9 |
-
- pointer-networks
|
| 10 |
-
- knowledge-distillation
|
| 11 |
-
- sparse-attention
|
| 12 |
-
- pytorch
|
| 13 |
-
pipeline_tag: text-generation
|
| 14 |
-
---
|
| 15 |
-
|
| 16 |
-
# Pointer: Decoder-only Transformer with Relational Routing
|
| 17 |
-
|
| 18 |
-
Pointer is a novel Decoder-only transformer architecture that implements relational routing through sparse pointer mechanisms. The core innovation lies in writing "edges" into weights while dereferencing node vectors at runtime, combined with FFN blocks for non-linear transformations.
|
| 19 |
-
|
| 20 |
-
## Model Architecture
|
| 21 |
-
|
| 22 |
-
### Core Innovation: Pointer Block
|
| 23 |
-
The PointerBlock is the heart of this architecture, implementing:
|
| 24 |
-
- **Sparse Address Generation**: Creates sparse address distributions through top-k selection
|
| 25 |
-
- **Multi-head Attention**: Uses multiple attention heads for pointer computation
|
| 26 |
-
- **Dynamic Vector Aggregation**: Aggregates neighbor vectors based on pointer probabilities
|
| 27 |
-
- **Pointer-of-Pointer Chaining**: Enables hierarchical knowledge addressing across layers
|
| 28 |
-
|
| 29 |
-
### Architecture Components
|
| 30 |
-
|
| 31 |
-
```
|
| 32 |
-
TokenEmbedding β [PointerLayer Γ N] β LayerNorm β LM Head
|
| 33 |
-
|
| 34 |
-
PointerLayer:
|
| 35 |
-
βββ LayerNorm
|
| 36 |
-
βββ PointerBlock (sparse addressing + aggregation)
|
| 37 |
-
βββ Gate + Residual Connection
|
| 38 |
-
βββ LayerNorm
|
| 39 |
-
βββ FFN (d β d_ff β d)
|
| 40 |
-
```
|
| 41 |
-
|
| 42 |
-
### Key Features
|
| 43 |
-
- **Relational Routing**: Only "edges" are written into weights, node vectors are dereferenced at runtime
|
| 44 |
-
- **Sparse Attention**: Top-k selection mechanism for efficient computation
|
| 45 |
-
- **Knowledge Address Chains**: Higher layers reference increasingly abstract relationship patterns
|
| 46 |
-
- **KV Caching**: Efficient inference with dynamic cache expansion
|
| 47 |
-
|
| 48 |
-
## Model Specifications
|
| 49 |
-
|
| 50 |
-
| Parameter | Value |
|
| 51 |
-
|-----------|-------|
|
| 52 |
-
| Architecture | Decoder-only Transformer |
|
| 53 |
-
| Model Size | Pointer-300M |
|
| 54 |
-
| Vocabulary Size | Dynamic (based on tokenizer) |
|
| 55 |
-
| Hidden Dimension (d) | 1,024 |
|
| 56 |
-
| Number of Layers | 24 |
|
| 57 |
-
| Attention Heads | 16 |
|
| 58 |
-
| Top-k Selection | 2 |
|
| 59 |
-
| FFN Expansion Ratio | 2.7 |
|
| 60 |
-
| Maximum Sequence Length | 4,096 |
|
| 61 |
-
| Parameters | ~300M |
|
| 62 |
-
| Dropout | 0.1 |
|
| 63 |
-
| FP16 Training | Yes |
|
| 64 |
-
| Tied Embeddings | Yes |
|
| 65 |
-
|
| 66 |
-
## Training Details
|
| 67 |
-
|
| 68 |
-
### Mix-Distillation Strategy
|
| 69 |
-
The model was trained using Mix-Distillation following the "Small Models Struggle to Learn from Strong Reasoners" approach:
|
| 70 |
-
|
| 71 |
-
- **Teacher Model**: DeepSeek-R1
|
| 72 |
-
- **Training Data**: Mix-Long strategy with Long-CoT : Short-CoT in 0.2 : 0.8 ratio
|
| 73 |
-
- **Training Steps**: 10,000 steps with gradient accumulation
|
| 74 |
-
- **Precision**: FP16 with numerical stability protections
|
| 75 |
-
|
| 76 |
-
### Training Hyperparameters
|
| 77 |
-
```yaml
|
| 78 |
-
num_epochs: 2
|
| 79 |
-
per_device_batch_size: 4
|
| 80 |
-
gradient_accumulation_steps: 4
|
| 81 |
-
effective_batch_size: 16 # 4 * 4
|
| 82 |
-
learning_rate: 2e-4
|
| 83 |
-
lr_scheduler: cosine
|
| 84 |
-
warmup_ratio: 0.05
|
| 85 |
-
weight_decay: 0.01
|
| 86 |
-
save_steps: 1000
|
| 87 |
-
eval_steps: 500
|
| 88 |
-
logging_steps: 50
|
| 89 |
-
fp16: true
|
| 90 |
-
```
|
| 91 |
-
|
| 92 |
-
### Distillation Configuration
|
| 93 |
-
```yaml
|
| 94 |
-
temperature: 2.0
|
| 95 |
-
alpha: 0.5 # KD loss weight
|
| 96 |
-
beta: 1.0 # CE loss weight
|
| 97 |
-
gamma: 0.5 # Additional loss weight
|
| 98 |
-
use_kd_loss: true
|
| 99 |
-
use_ce_loss: true
|
| 100 |
-
use_hidden_mse: false
|
| 101 |
-
use_pointer_kl: false
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
### Training Data
|
| 105 |
-
- **Dataset Size**: 110,000 samples from Chinese-DeepSeek-R1-Distill
|
| 106 |
-
- **CoT Distribution**:
|
| 107 |
-
- Long-CoT: 22,000 samples (20%)
|
| 108 |
-
- Short-CoT: 88,000 samples (80%)
|
| 109 |
-
- **Sequence Length**: 21-2,048 tokens (mean: 885, median: 721)
|
| 110 |
-
- **Quality Scores**: 7-10 (mean: 9.09)
|
| 111 |
-
|
| 112 |
-
### Loss Components
|
| 113 |
-
- **Cross-Entropy Loss**: Standard language modeling objective
|
| 114 |
-
- **Hidden State MSE**: Knowledge distillation from teacher hidden states
|
| 115 |
-
- **Pointer KL Divergence**: Alignment of pointer attention distributions
|
| 116 |
-
- **Pointer Cross-Entropy**: Hard distillation for pointer indices
|
| 117 |
-
|
| 118 |
-
## Key Innovations
|
| 119 |
-
|
| 120 |
-
### 1. Pointer-of-Pointer Mechanism
|
| 121 |
-
Each layer produces pointer indices to previous positions, and the next layer uses these indices to create "pointer-of-pointer" chains, enabling hierarchical knowledge addressing patterns.
|
| 122 |
-
|
| 123 |
-
### 2. Sparse Relational Routing
|
| 124 |
-
Instead of dense attention, the model uses sparse top-k selection to identify the most relevant connections, making computation more efficient while maintaining expressiveness.
|
| 125 |
-
|
| 126 |
-
### 3. Runtime Vector Dereferencing
|
| 127 |
-
Unlike traditional transformers that compute attention over all positions, Pointer writes relationship patterns into weights and dereferences specific node vectors only when needed.
|
| 128 |
-
|
| 129 |
-
### 4. Numerical Stability for FP16
|
| 130 |
-
Extensive NaN detection and handling throughout the forward pass, including:
|
| 131 |
-
- Input validation in embeddings
|
| 132 |
-
- Attention score clamping
|
| 133 |
-
- Emergency NaN repairs
|
| 134 |
-
|
| 135 |
-
## Usage
|
| 136 |
-
|
| 137 |
-
```python
|
| 138 |
-
import torch
|
| 139 |
-
from src.model.pointer_model import PointerDecoder
|
| 140 |
-
|
| 141 |
-
# Initialize Pointer-300M model with your config
|
| 142 |
-
model = PointerDecoder(
|
| 143 |
-
vocab_size=tokenizer.vocab_size, # Dynamic based on tokenizer
|
| 144 |
-
d=1024, # Hidden dimension
|
| 145 |
-
n_layers=24, # Number of layers
|
| 146 |
-
n_heads=16, # Attention heads
|
| 147 |
-
top_k=2, # Pointer selection
|
| 148 |
-
r=2.7, # FFN expansion ratio
|
| 149 |
-
max_seq_len=4096, # Max sequence length
|
| 150 |
-
dropout=0.1, # Dropout rate
|
| 151 |
-
tie_embeddings=True, # Tie input/output embeddings
|
| 152 |
-
fp16=True # FP16 training
|
| 153 |
-
)
|
| 154 |
-
|
| 155 |
-
# Forward pass
|
| 156 |
-
input_ids = torch.randint(0, tokenizer.vocab_size, (1, 100))
|
| 157 |
-
logits = model(input_ids)
|
| 158 |
-
|
| 159 |
-
# Inference with caching
|
| 160 |
-
cache = model.init_cache(batch_size=1)
|
| 161 |
-
for token in input_sequence:
|
| 162 |
-
logits, cache = model.step(token, cache)
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
## File Structure
|
| 166 |
-
|
| 167 |
-
```
|
| 168 |
-
src/
|
| 169 |
-
βββ layers/
|
| 170 |
-
β βββ embedding.py # TokenEmbedding with vocab reduction support
|
| 171 |
-
β βββ rotary.py # Rotary positional encoding
|
| 172 |
-
β βββ pointer_block.py # Core PointerBlock implementation
|
| 173 |
-
β βββ ffn.py # Feed-forward network
|
| 174 |
-
β βββ pointer_layer.py # PointerBlock + FFN + Residual connections
|
| 175 |
-
βββ model/
|
| 176 |
-
βββ pointer_model.py # Complete PointerDecoder implementation
|
| 177 |
-
```
|
| 178 |
-
|
| 179 |
-
## Supported Languages
|
| 180 |
-
|
| 181 |
-
- English
|
| 182 |
-
- Chinese (Simplified)
|
| 183 |
-
|
| 184 |
-
## Limitations
|
| 185 |
-
|
| 186 |
-
- Currently supports only left-to-right generation (no bidirectional)
|
| 187 |
-
- Requires careful FP16 training due to numerical stability considerations
|
| 188 |
-
- Top-k selection parameter needs tuning for different tasks
|
| 189 |
-
- Model size is 300M parameters (smaller than larger language models)
|
| 190 |
-
- Trained primarily on Chinese data with DeepSeek-R1 distillation
|
| 191 |
-
|
| 192 |
-
## Citation
|
| 193 |
-
|
| 194 |
-
If you use this model in your research, please cite:
|
| 195 |
-
|
| 196 |
-
```bibtex
|
| 197 |
-
@misc{pointer300m2025,
|
| 198 |
-
title={Pointer-Mini: Decoder-only Transformer with Relational Routing},
|
| 199 |
-
author={[Noesis Lab]},
|
| 200 |
-
year={2025},
|
| 201 |
-
howpublished={\url{https://huggingface.co/NoesisLab/Pointer-Mini}}
|
| 202 |
-
}
|
| 203 |
-
```
|
| 204 |
-
|
| 205 |
-
## License
|
| 206 |
-
|
| 207 |
-
This model is released under the Apache 2.0 License.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|