OzTianlu commited on
Commit
623b961
Β·
verified Β·
1 Parent(s): 11018b7

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -207
README.md DELETED
@@ -1,207 +0,0 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- library_name: pytorch
6
- tags:
7
- - transformer
8
- - decoder-only
9
- - pointer-networks
10
- - knowledge-distillation
11
- - sparse-attention
12
- - pytorch
13
- pipeline_tag: text-generation
14
- ---
15
-
16
- # Pointer: Decoder-only Transformer with Relational Routing
17
-
18
- Pointer is a novel Decoder-only transformer architecture that implements relational routing through sparse pointer mechanisms. The core innovation lies in writing "edges" into weights while dereferencing node vectors at runtime, combined with FFN blocks for non-linear transformations.
19
-
20
- ## Model Architecture
21
-
22
- ### Core Innovation: Pointer Block
23
- The PointerBlock is the heart of this architecture, implementing:
24
- - **Sparse Address Generation**: Creates sparse address distributions through top-k selection
25
- - **Multi-head Attention**: Uses multiple attention heads for pointer computation
26
- - **Dynamic Vector Aggregation**: Aggregates neighbor vectors based on pointer probabilities
27
- - **Pointer-of-Pointer Chaining**: Enables hierarchical knowledge addressing across layers
28
-
29
- ### Architecture Components
30
-
31
- ```
32
- TokenEmbedding β†’ [PointerLayer Γ— N] β†’ LayerNorm β†’ LM Head
33
-
34
- PointerLayer:
35
- β”œβ”€β”€ LayerNorm
36
- β”œβ”€β”€ PointerBlock (sparse addressing + aggregation)
37
- β”œβ”€β”€ Gate + Residual Connection
38
- β”œβ”€β”€ LayerNorm
39
- └── FFN (d β†’ d_ff β†’ d)
40
- ```
41
-
42
- ### Key Features
43
- - **Relational Routing**: Only "edges" are written into weights, node vectors are dereferenced at runtime
44
- - **Sparse Attention**: Top-k selection mechanism for efficient computation
45
- - **Knowledge Address Chains**: Higher layers reference increasingly abstract relationship patterns
46
- - **KV Caching**: Efficient inference with dynamic cache expansion
47
-
48
- ## Model Specifications
49
-
50
- | Parameter | Value |
51
- |-----------|-------|
52
- | Architecture | Decoder-only Transformer |
53
- | Model Size | Pointer-300M |
54
- | Vocabulary Size | Dynamic (based on tokenizer) |
55
- | Hidden Dimension (d) | 1,024 |
56
- | Number of Layers | 24 |
57
- | Attention Heads | 16 |
58
- | Top-k Selection | 2 |
59
- | FFN Expansion Ratio | 2.7 |
60
- | Maximum Sequence Length | 4,096 |
61
- | Parameters | ~300M |
62
- | Dropout | 0.1 |
63
- | FP16 Training | Yes |
64
- | Tied Embeddings | Yes |
65
-
66
- ## Training Details
67
-
68
- ### Mix-Distillation Strategy
69
- The model was trained using Mix-Distillation following the "Small Models Struggle to Learn from Strong Reasoners" approach:
70
-
71
- - **Teacher Model**: DeepSeek-R1
72
- - **Training Data**: Mix-Long strategy with Long-CoT : Short-CoT in 0.2 : 0.8 ratio
73
- - **Training Steps**: 10,000 steps with gradient accumulation
74
- - **Precision**: FP16 with numerical stability protections
75
-
76
- ### Training Hyperparameters
77
- ```yaml
78
- num_epochs: 2
79
- per_device_batch_size: 4
80
- gradient_accumulation_steps: 4
81
- effective_batch_size: 16 # 4 * 4
82
- learning_rate: 2e-4
83
- lr_scheduler: cosine
84
- warmup_ratio: 0.05
85
- weight_decay: 0.01
86
- save_steps: 1000
87
- eval_steps: 500
88
- logging_steps: 50
89
- fp16: true
90
- ```
91
-
92
- ### Distillation Configuration
93
- ```yaml
94
- temperature: 2.0
95
- alpha: 0.5 # KD loss weight
96
- beta: 1.0 # CE loss weight
97
- gamma: 0.5 # Additional loss weight
98
- use_kd_loss: true
99
- use_ce_loss: true
100
- use_hidden_mse: false
101
- use_pointer_kl: false
102
- ```
103
-
104
- ### Training Data
105
- - **Dataset Size**: 110,000 samples from Chinese-DeepSeek-R1-Distill
106
- - **CoT Distribution**:
107
- - Long-CoT: 22,000 samples (20%)
108
- - Short-CoT: 88,000 samples (80%)
109
- - **Sequence Length**: 21-2,048 tokens (mean: 885, median: 721)
110
- - **Quality Scores**: 7-10 (mean: 9.09)
111
-
112
- ### Loss Components
113
- - **Cross-Entropy Loss**: Standard language modeling objective
114
- - **Hidden State MSE**: Knowledge distillation from teacher hidden states
115
- - **Pointer KL Divergence**: Alignment of pointer attention distributions
116
- - **Pointer Cross-Entropy**: Hard distillation for pointer indices
117
-
118
- ## Key Innovations
119
-
120
- ### 1. Pointer-of-Pointer Mechanism
121
- Each layer produces pointer indices to previous positions, and the next layer uses these indices to create "pointer-of-pointer" chains, enabling hierarchical knowledge addressing patterns.
122
-
123
- ### 2. Sparse Relational Routing
124
- Instead of dense attention, the model uses sparse top-k selection to identify the most relevant connections, making computation more efficient while maintaining expressiveness.
125
-
126
- ### 3. Runtime Vector Dereferencing
127
- Unlike traditional transformers that compute attention over all positions, Pointer writes relationship patterns into weights and dereferences specific node vectors only when needed.
128
-
129
- ### 4. Numerical Stability for FP16
130
- Extensive NaN detection and handling throughout the forward pass, including:
131
- - Input validation in embeddings
132
- - Attention score clamping
133
- - Emergency NaN repairs
134
-
135
- ## Usage
136
-
137
- ```python
138
- import torch
139
- from src.model.pointer_model import PointerDecoder
140
-
141
- # Initialize Pointer-300M model with your config
142
- model = PointerDecoder(
143
- vocab_size=tokenizer.vocab_size, # Dynamic based on tokenizer
144
- d=1024, # Hidden dimension
145
- n_layers=24, # Number of layers
146
- n_heads=16, # Attention heads
147
- top_k=2, # Pointer selection
148
- r=2.7, # FFN expansion ratio
149
- max_seq_len=4096, # Max sequence length
150
- dropout=0.1, # Dropout rate
151
- tie_embeddings=True, # Tie input/output embeddings
152
- fp16=True # FP16 training
153
- )
154
-
155
- # Forward pass
156
- input_ids = torch.randint(0, tokenizer.vocab_size, (1, 100))
157
- logits = model(input_ids)
158
-
159
- # Inference with caching
160
- cache = model.init_cache(batch_size=1)
161
- for token in input_sequence:
162
- logits, cache = model.step(token, cache)
163
- ```
164
-
165
- ## File Structure
166
-
167
- ```
168
- src/
169
- β”œβ”€β”€ layers/
170
- β”‚ β”œβ”€β”€ embedding.py # TokenEmbedding with vocab reduction support
171
- β”‚ β”œβ”€β”€ rotary.py # Rotary positional encoding
172
- β”‚ β”œβ”€β”€ pointer_block.py # Core PointerBlock implementation
173
- β”‚ β”œβ”€β”€ ffn.py # Feed-forward network
174
- β”‚ └── pointer_layer.py # PointerBlock + FFN + Residual connections
175
- └── model/
176
- └── pointer_model.py # Complete PointerDecoder implementation
177
- ```
178
-
179
- ## Supported Languages
180
-
181
- - English
182
- - Chinese (Simplified)
183
-
184
- ## Limitations
185
-
186
- - Currently supports only left-to-right generation (no bidirectional)
187
- - Requires careful FP16 training due to numerical stability considerations
188
- - Top-k selection parameter needs tuning for different tasks
189
- - Model size is 300M parameters (smaller than larger language models)
190
- - Trained primarily on Chinese data with DeepSeek-R1 distillation
191
-
192
- ## Citation
193
-
194
- If you use this model in your research, please cite:
195
-
196
- ```bibtex
197
- @misc{pointer300m2025,
198
- title={Pointer-Mini: Decoder-only Transformer with Relational Routing},
199
- author={[Noesis Lab]},
200
- year={2025},
201
- howpublished={\url{https://huggingface.co/NoesisLab/Pointer-Mini}}
202
- }
203
- ```
204
-
205
- ## License
206
-
207
- This model is released under the Apache 2.0 License.