adaptive_rag / bert_encoder_detailed_explained.py
lanny xu
resolve conflict
7edc5f6
raw
history blame
40.8 kB
"""
BERT Encoder 12ๅฑ‚่ฏฆ็ป†่งฃๆž
ๅฑ•็คบ Vectara HHEM ไธญ BERT ็ผ–็ ๅ™จ็š„ๆฏไธ€ๅฑ‚ๅค„็†่ฟ‡็จ‹
ไฝฟ็”จ็œŸๅฎž่ฎญ็ปƒๆ ทๆœฌๆผ”็คบๆ•ฐๆฎๆต่ฝฌ
"""
import numpy as np
print("=" * 80)
print("BERT Encoder 12ๅฑ‚ๅฎŒๆ•ด่งฃๆž - ่”ๅˆ็ผ–็ ๅนป่ง‰ๆฃ€ๆต‹")
print("=" * 80)
# ============================================================================
# Part 1: ่ฎญ็ปƒๆ ทๆœฌๅ‡†ๅค‡
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ“š Part 1: ่ฎญ็ปƒๆ ทๆœฌ")
print("=" * 80)
print("""
่ฎญ็ปƒๆ ทๆœฌ๏ผˆๅนป่ง‰ๆฃ€ๆต‹๏ผ‰๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Documents (ๆ–‡ๆกฃ):
"AlphaCodium ๆ˜ฏไธ€็งไปฃ็ ็”Ÿๆˆๆ–นๆณ•๏ผŒ้€š่ฟ‡่ฟญไปฃๆ”น่ฟ›ๆๅ‡ๆ€ง่ƒฝใ€‚"
Generation (LLM็”Ÿๆˆ):
"AlphaCodium ๆ˜ฏ Google ๅœจ 2024 ๅนดๅ‘ๅธƒ็š„ไปฃ็ ็”Ÿๆˆๅทฅๅ…ทใ€‚"
Label (ๆ ‡็ญพ):
Hallucinated โŒ
ๅŽŸๅ› :
- "Google" ๅœจๆ–‡ๆกฃไธญๆฒกๆœ‰ โ†’ ๅนป่ง‰
- "2024 ๅนด" ๅœจๆ–‡ๆกฃไธญๆฒกๆœ‰ โ†’ ๅนป่ง‰
- "ๅทฅๅ…ท" vs "ๆ–นๆณ•" โ†’ ่ฏ่ฏญไธ็ฒพ็กฎ
""")
# ============================================================================
# Part 2: Tokenization ๅ’Œๅˆๅง‹ Embeddings
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ”ง Part 2: ่พ“ๅ…ฅๅ‡†ๅค‡ - Tokenization")
print("=" * 80)
print("""
Step 1: ๆ–‡ๆœฌๆ‹ผๆŽฅ
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
่พ“ๅ…ฅๆ ผๅผ:
[CLS] Documents [SEP] Generation [SEP]
ๅฎž้™…ๆ‹ผๆŽฅๅŽ:
[CLS] AlphaCodium ๆ˜ฏไธ€็งไปฃ็ ็”Ÿๆˆๆ–นๆณ•๏ผŒ้€š่ฟ‡่ฟญไปฃๆ”น่ฟ›ๆๅ‡ๆ€ง่ƒฝใ€‚
[SEP] AlphaCodium ๆ˜ฏ Google ๅœจ 2024 ๅนดๅ‘ๅธƒ็š„ไปฃ็ ็”Ÿๆˆๅทฅๅ…ทใ€‚
[SEP]
Step 2: Tokenization (BERT WordPiece ๅˆ†่ฏ)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๅˆ†่ฏ็ป“ๆžœ๏ผˆ็ฎ€ๅŒ–๏ผŒๅฎž้™…ไผšๆ›ด็ป†๏ผ‰:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ไฝ็ฝฎ Token Token ID Segment ID
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
0 [CLS] 101 0
1 Alpha 2945 0
2 ##Codium 3421 0
3 ๆ˜ฏ 2003 0
4 ไธ€็ง 1037 0
5 ไปฃ็  4521 0
6 ็”Ÿๆˆ 3156 0
7 ๆ–นๆณ• 2567 0
8 ๏ผŒ 110 0
9 ้€š่ฟ‡ 2134 0
10 ่ฟญไปฃ 3789 0
11 ๆ”น่ฟ› 2891 0
12 ๆๅ‡ 4123 0
13 ๆ€ง่ƒฝ 3456 0
14 ใ€‚ 119 0
15 [SEP] 102 0 โ† ็ฌฌไธ€ไธชๅˆ†้š”็ฌฆ
16 Alpha 2945 1 โ† Segment ID ๅ˜ไธบ 1
17 ##Codium 3421 1
18 ๆ˜ฏ 2003 1
19 Google 5678 1
20 ๅœจ 2156 1
21 2024 4532 1
22 ๅนด 3267 1
23 ๅ‘ๅธƒ 2789 1
24 ็š„ 1998 1
25 ไปฃ็  4521 1
26 ็”Ÿๆˆ 3156 1
27 ๅทฅๅ…ท 3890 1
28 ใ€‚ 119 1
29 [SEP] 102 1 โ† ็ฌฌไบŒไธชๅˆ†้š”็ฌฆ
ๆ€ปๅ…ฑ: 30 ไธช tokens
Step 3: ๅˆๅง‹ Embeddings
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
BERT ็š„่พ“ๅ…ฅ = Token Embedding + Segment Embedding + Position Embedding
ๅฏนไบŽๆฏไธช token๏ผŒ่Žทๅ–ไธ‰ไธช embedding ๅนถ็›ธๅŠ ๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ไปฅ Token 0 "[CLS]" ไธบไพ‹:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
1. Token Embedding (่ฏๅตŒๅ…ฅ่กจๆŸฅ่ฏข)
Token ID: 101
โ†’ Embedding Table[101] = [0.12, -0.34, 0.56, ..., 0.78] (768็ปด)
2. Segment Embedding (ๆฎต่ฝๅตŒๅ…ฅ)
Segment ID: 0 (ๅฑžไบŽ Documents ้ƒจๅˆ†)
โ†’ Segment Table[0] = [0.05, 0.02, -0.01, ..., 0.03] (768็ปด)
3. Position Embedding (ไฝ็ฝฎๅตŒๅ…ฅ)
Position: 0 (็ฌฌไธ€ไธชไฝ็ฝฎ)
โ†’ Position Table[0] = [0.08, -0.12, 0.15, ..., -0.05] (768็ปด)
4. ็›ธๅŠ ๅพ—ๅˆฐๅˆๅง‹ๅ‘้‡
Initial Embedding[0] = Token + Segment + Position
= [0.12, -0.34, 0.56, ..., 0.78]
+ [0.05, 0.02, -0.01, ..., 0.03]
+ [0.08, -0.12, 0.15, ..., -0.05]
= [0.25, -0.44, 0.70, ..., 0.76] (768็ปด)
ๆ‰€ๆœ‰ tokens ็š„ๅˆๅง‹ๅ‘้‡็Ÿฉ้˜ต:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Hโฐ = [
[0.25, -0.44, 0.70, ..., 0.76], โ† Token 0: [CLS]
[0.15, 0.32, -0.23, ..., 0.45], โ† Token 1: Alpha
[0.28, -0.15, 0.41, ..., 0.52], โ† Token 2: ##Codium
...
[0.19, 0.27, -0.38, ..., 0.61] โ† Token 29: [SEP]
]
ๅฝข็Šถ: (30, 768)
โ†‘ โ†‘
30ไธชtokens ๆฏไธช768็ปด
""")
# ============================================================================
# Part 3: BERT Encoder Layer ่ฏฆ็ป†็ป“ๆž„
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ—๏ธ Part 3: BERT Encoder Layer ็ป“ๆž„๏ผˆๅ•ๅฑ‚่ฏฆ่งฃ๏ผ‰")
print("=" * 80)
print("""
ๆฏไธ€ๅฑ‚ BERT Encoder ็š„็ป“ๆž„:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
่พ“ๅ…ฅ: H^(l-1) (ไธŠไธ€ๅฑ‚็š„่พ“ๅ‡บ๏ผŒๅฝข็Šถ: 30ร—768)
่พ“ๅ‡บ: H^l (ๆœฌๅฑ‚็š„่พ“ๅ‡บ๏ผŒๅฝข็Šถ: 30ร—768)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ BERT Encoder Layer โ”‚
โ”‚ โ”‚
โ”‚ ่พ“ๅ…ฅ: H^(l-1) (30, 768) โ”‚
โ”‚ โ†“ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Sub-Layer 1: Multi-Head Self-Attention โ”‚ โ”‚
โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ 1.1 ่ฎก็ฎ— Q, K, V โ”‚ โ”‚
โ”‚ โ”‚ Q = H^(l-1) ร— W^Q (30ร—768 ร— 768ร—768) โ”‚ โ”‚
โ”‚ โ”‚ K = H^(l-1) ร— W^K (30ร—768 ร— 768ร—768) โ”‚ โ”‚
โ”‚ โ”‚ V = H^(l-1) ร— W^V (30ร—768 ร— 768ร—768) โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ 1.2 ๅˆ†ๆˆ 12 ไธช Head โ”‚ โ”‚
โ”‚ โ”‚ ๆฏไธช Head: 768 / 12 = 64 ็ปด โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ 1.3 ๆฏไธช Head ่ฎก็ฎ— Attention โ”‚ โ”‚
โ”‚ โ”‚ Attention = softmax(Qร—K^T / โˆš64) ร— V โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ 1.4 Concat ๆ‰€ๆœ‰ Heads โ”‚ โ”‚
โ”‚ โ”‚ Output = Concat(headโ‚, ..., headโ‚โ‚‚) โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ 1.5 ็บฟๆ€งๅ˜ๆข โ”‚ โ”‚
โ”‚ โ”‚ Output = Output ร— W^O โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Add & Norm 1 โ”‚ โ”‚
โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚
โ”‚ โ”‚ H_att = LayerNorm(H^(l-1) + Attention_Output) โ”‚ โ”‚
โ”‚ โ”‚ โ†‘ ๆฎ‹ๅทฎ่ฟžๆŽฅ โ†‘ Attention ่พ“ๅ‡บ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Sub-Layer 2: Feed Forward Network โ”‚ โ”‚
โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ 2.1 ็ฌฌไธ€ๅฑ‚ๅ…จ่ฟžๆŽฅ + ReLU โ”‚ โ”‚
โ”‚ โ”‚ FFNโ‚ = ReLU(H_att ร— Wโ‚ + bโ‚) โ”‚ โ”‚
โ”‚ โ”‚ (30ร—768 ร— 768ร—3072 = 30ร—3072) โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ 2.2 ็ฌฌไบŒๅฑ‚ๅ…จ่ฟžๆŽฅ โ”‚ โ”‚
โ”‚ โ”‚ FFNโ‚‚ = FFNโ‚ ร— Wโ‚‚ + bโ‚‚ โ”‚ โ”‚
โ”‚ โ”‚ (30ร—3072 ร— 3072ร—768 = 30ร—768) โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Add & Norm 2 โ”‚ โ”‚
โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚
โ”‚ โ”‚ H^l = LayerNorm(H_att + FFNโ‚‚) โ”‚ โ”‚
โ”‚ โ”‚ โ†‘ ๆฎ‹ๅทฎ่ฟžๆŽฅ โ†‘ FFN ่พ“ๅ‡บ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ ่พ“ๅ‡บ: H^l (30, 768) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
ๅ…ณ้”ฎๅ‚ๆ•ฐ:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
- Hidden Size: 768
- Attention Heads: 12
- Head Dimension: 768 / 12 = 64
- Intermediate Size (FFN): 3072
- Dropout: 0.1
""")
# ============================================================================
# Part 4: Multi-Head Self-Attention ่ฏฆ็ป†่ฎก็ฎ—
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ” Part 4: Multi-Head Self-Attention ่ฏฆ็ป†่ฎก็ฎ—่ฟ‡็จ‹")
print("=" * 80)
print("""
ไปฅ Layer 1 ไธบไพ‹๏ผŒ่ฏฆ็ป†ๅฑ•็คบ Attention ่ฎก็ฎ—:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
่พ“ๅ…ฅ: Hโฐ (30, 768) # ๅˆๅง‹ embeddings
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Step 1: ่ฎก็ฎ— Q, K, V
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Q = Hโฐ ร— W^Q
= (30, 768) ร— (768, 768)
= (30, 768)
K = Hโฐ ร— W^K
= (30, 768) ร— (768, 768)
= (30, 768)
V = Hโฐ ร— W^V
= (30, 768) ร— (768, 768)
= (30, 768)
ๅฎž้™…ๆ•ฐๅ€ผ็คบไพ‹๏ผˆๅชๅฑ•็คบๅ‰3ไธชtoken็š„ๅ‰8็ปด๏ผ‰:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Q = [
[0.15, -0.23, 0.34, 0.12, -0.45, 0.67, 0.89, -0.12, ...], โ† [CLS]
[0.22, 0.18, -0.31, 0.45, 0.23, -0.56, 0.34, 0.78, ...], โ† Alpha
[0.34, -0.12, 0.45, -0.23, 0.67, 0.12, -0.89, 0.45, ...], โ† ##Codium
...
]
K ๅ’Œ V ็ฑปไผผ็ป“ๆž„
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Step 2: ๅˆ†ๆˆ 12 ไธช Head
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๅฐ† 768 ็ปดๅˆ†ๆˆ 12 ไปฝ๏ผŒๆฏไปฝ 64 ็ปด๏ผš
Head 0: Q[:, 0:64], K[:, 0:64], V[:, 0:64]
Head 1: Q[:, 64:128], K[:, 64:128], V[:, 64:128]
...
Head 11: Q[:, 704:768], K[:, 704:768], V[:, 704:768]
ไปฅ Head 0 ไธบไพ‹:
Qโ‚€ = Q[:, 0:64] # (30, 64)
Kโ‚€ = K[:, 0:64] # (30, 64)
Vโ‚€ = V[:, 0:64] # (30, 64)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Step 3: ่ฎก็ฎ— Attention Scores (Head 0)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Scores = Qโ‚€ ร— Kโ‚€^T / โˆš64
= (30, 64) ร— (64, 30) / 8
= (30, 30) / 8
็ป“ๆžœ็Ÿฉ้˜ต Scores (30, 30):
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๆฏไธชๅ…ƒ็ด  Scores[i][j] ่กจ็คบ token i ๅฏน token j ็š„ๆณจๆ„ๅŠ›ๅˆ†ๆ•ฐ
็คบไพ‹๏ผˆๅ‰5x5๏ผ‰:
โ†“ Key tokens
[CLS] Alpha ##Cod ๆ˜ฏ ไธ€็ง
[CLS] [2.3 1.5 1.8 0.9 0.7 ...] โ† Query: [CLS]
Alpha [1.2 3.1 2.9 1.1 0.8 ...] โ† Query: Alpha
##Cod [1.0 2.8 3.5 1.3 0.9 ...] โ† Query: ##Codium
ๆ˜ฏ [0.8 1.2 1.4 2.1 1.5 ...] โ† Query: ๆ˜ฏ
ไธ€็ง [0.6 0.9 1.0 1.6 2.3 ...] โ† Query: ไธ€็ง
...
่งฃ้‡Š:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Scores[0][0] = 2.3 โ†’ [CLS] ๅฏน่‡ชๅทฑ็š„ๆณจๆ„ๅŠ›
Scores[1][2] = 2.9 โ†’ "Alpha" ๅฏน "##Codium" ็š„ๆณจๆ„ๅŠ›๏ผˆๅพˆ้ซ˜๏ผŒๅ› ไธบๆ˜ฏๅŒไธ€ไธช่ฏ๏ผ‰
Scores[19][1] = 1.8 โ†’ "Google"(pos 19) ๅฏน "Alpha"(pos 1) ็š„ๆณจๆ„ๅŠ›
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Step 4: Softmax ๅฝ’ไธ€ๅŒ–
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Attention_Weights = softmax(Scores, dim=-1)
ๅฏนๆฏไธ€่กŒๅš softmax๏ผˆๅ’Œไธบ1๏ผ‰:
็คบไพ‹๏ผˆๅ‰5x5๏ผŒๅฝ’ไธ€ๅŒ–ๅŽ๏ผ‰:
โ†“ Key tokens
[CLS] Alpha ##Cod ๆ˜ฏ ไธ€็ง ...
[CLS] [0.35 0.15 0.20 0.08 0.05 ...] โ† ๆ€ปๅ’Œ=1.0
Alpha [0.10 0.40 0.35 0.08 0.04 ...] โ† ๆ€ปๅ’Œ=1.0
##Cod [0.08 0.28 0.45 0.10 0.06 ...] โ† ๆ€ปๅ’Œ=1.0
ๆ˜ฏ [0.12 0.18 0.20 0.30 0.15 ...] โ† ๆ€ปๅ’Œ=1.0
ไธ€็ง [0.10 0.14 0.16 0.22 0.32 ...] โ† ๆ€ปๅ’Œ=1.0
...
ๅ…ณ้”ฎ่ง‚ๅฏŸ:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
- "Alpha" ๅฏน "##Codium" ็š„ๆƒ้‡ = 0.35๏ผˆ้ซ˜๏ผ๏ผ‰
โ†’ ่ฏดๆ˜Žๆจกๅž‹ๅญฆไผšไบ†ๅฎƒไปฌๆ˜ฏๅŒไธ€ไธช่ฏ
- "Google" (pos 19) ๅฏน Documents ไธญ็š„ tokens ๆƒ้‡่พƒไฝŽ
โ†’ ๅ› ไธบ Documents ไธญๆฒกๆœ‰ "Google"
โ†’ ่ฟ™ไธชไฟกๆฏไผš่ขซ็”จไบŽๅˆคๆ–ญๅนป่ง‰๏ผ
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Step 5: ๅŠ ๆƒๆฑ‚ๅ’Œ V
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Outputโ‚€ = Attention_Weights ร— Vโ‚€
= (30, 30) ร— (30, 64)
= (30, 64)
ๅฏนไบŽๆฏไธช token i:
Outputโ‚€[i] = ฮฃโฑผ Attention_Weights[i][j] ร— Vโ‚€[j]
็คบไพ‹๏ผˆtoken 0 "[CLS]" ็š„่พ“ๅ‡บ๏ผ‰:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Outputโ‚€[0] = 0.35 ร— Vโ‚€[0] ([CLS] ็š„ value)
+ 0.15 ร— Vโ‚€[1] (Alpha ็š„ value)
+ 0.20 ร— Vโ‚€[2] (##Codium ็š„ value)
+ 0.08 ร— Vโ‚€[3] (ๆ˜ฏ ็š„ value)
+ ...
+ 0.02 ร— Vโ‚€[19] (Google ็š„ value) โ† ๆƒ้‡ๅพˆๅฐ๏ผ
+ ...
็ป“ๆžœ: [0.23, -0.15, 0.34, ..., 0.67] (64็ปด)
[CLS] ็š„ๅ‘้‡็ŽฐๅœจๅŒ…ๅซไบ†:
- ไธป่ฆ: ่‡ชๅทฑใ€Alphaใ€##Codium ็š„ไฟกๆฏ๏ผˆๆƒ้‡ๅคง๏ผ‰
- ๅฐ‘้‡: Googleใ€2024 ็š„ไฟกๆฏ๏ผˆๆƒ้‡ๅฐ๏ผ‰
- ่ฟ™ไธชๅทฎๅผ‚ไผš่ขซๅŽ็ปญๅฑ‚ๆ”พๅคง๏ผŒ็”จไบŽๆฃ€ๆต‹ๅนป่ง‰๏ผ
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Step 6: Concat ๆ‰€ๆœ‰ 12 ไธช Heads
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Multi_Head_Output = Concat(Outputโ‚€, Outputโ‚, ..., Outputโ‚โ‚)
= Concat((30,64), (30,64), ..., (30,64))
= (30, 768)
ๆฏไธช Head ๆ•ๆ‰ไธๅŒ็š„ๆจกๅผ:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Head 0: ่ฏๅ†…ๅ…ณ็ณป ("Alpha" โ†” "##Codium")
Head 1: ่ฏญๆณ•ๅ…ณ็ณป ("ๆ˜ฏ" โ†” "ๆ–นๆณ•")
Head 2: ้•ฟ่ท็ฆปไพ่ต– ("AlphaCodium" โ†” "ๆ€ง่ƒฝ")
Head 3: ๆฃ€ๆต‹ๆทปๅŠ ไฟกๆฏ ("Google" ๅœจ Documents ไธญ็š„ๅฏนๅบ”)
...
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Step 7: ็บฟๆ€งๅ˜ๆข
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Attention_Output = Multi_Head_Output ร— W^O + b^O
= (30, 768) ร— (768, 768) + (768,)
= (30, 768)
""")
# ============================================================================
# Part 5: 12ๅฑ‚้€ๅฑ‚ๅค„็†
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ”ข Part 5: BERT 12ๅฑ‚้€ๅฑ‚ๅค„็†่ฟ‡็จ‹")
print("=" * 80)
print("""
ๅฎŒๆ•ด็š„ 12 ๅฑ‚ๅค„็†ๆต็จ‹:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
่พ“ๅ…ฅ: Hโฐ (30, 768) # ๅˆๅง‹ embeddings
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 1 โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ โ”‚
โ”‚ ่พ“ๅ…ฅ: Hโฐ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Multi-Head Self-Attention โ”‚
โ”‚ - "Alpha" attendๅˆฐ "##Codium" (0.35) โ”‚
โ”‚ - "Google" attendๅˆฐ Documents tokens (0.1-0.2) โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Add & Norm: H_attยน = LayerNorm(Hโฐ + Attention) โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Feed Forward: FFN(H_attยน) โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Add & Norm: Hยน = LayerNorm(H_attยน + FFN) โ”‚
โ”‚ โ”‚
โ”‚ ่พ“ๅ‡บ: Hยน (30, 768) โ”‚
โ”‚ โ”‚
โ”‚ ๅญฆๅˆฐ็š„ๆจกๅผ: โ”‚
โ”‚ โœ“ ๅŸบๆœฌ่ฏ่ฏญๅ…ณ็ณป โ”‚
โ”‚ โœ“ "AlphaCodium" ๅœจไธคๆฎตไธญ้ƒฝๅ‡บ็Žฐ โ”‚
โ”‚ โœ“ "Google" ๅชๅœจ Generation ไธญๅ‡บ็Žฐ โš ๏ธ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 2 โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ โ”‚
โ”‚ ่พ“ๅ…ฅ: Hยน โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Multi-Head Self-Attention โ”‚
โ”‚ - ๅผ€ๅง‹ๅปบ็ซ‹่ฏญๆณ•ๅ…ณ็ณป โ”‚
โ”‚ - "ๆ˜ฏ" attendๅˆฐ "ๆ–นๆณ•" ๅ’Œ "ๅทฅๅ…ท" โ”‚
โ”‚ โ†“ โ”‚
โ”‚ FFN + Residual โ”‚
โ”‚ โ†“ โ”‚
โ”‚ ่พ“ๅ‡บ: Hยฒ (30, 768) โ”‚
โ”‚ โ”‚
โ”‚ ๅญฆๅˆฐ็š„ๆจกๅผ: โ”‚
โ”‚ โœ“ "ๆ–นๆณ•" vs "ๅทฅๅ…ท" ็š„่ฏญไน‰ๅทฎๅผ‚ โ”‚
โ”‚ โœ“ ๆ—ถ้—ดไฟกๆฏ: "2024 ๅนด" โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 3 โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ โ”‚
โ”‚ ่พ“ๅ…ฅ: Hยฒ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ Multi-Head Self-Attention โ”‚
โ”‚ - ้•ฟ่ท็ฆปไพ่ต–ๅผ€ๅง‹ๅปบ็ซ‹ โ”‚
โ”‚ - [CLS] attendๅˆฐๅ…ณ้”ฎ่ฏ: "Google", "2024" โ”‚
โ”‚ โ†“ โ”‚
โ”‚ FFN + Residual โ”‚
โ”‚ โ†“ โ”‚
โ”‚ ่พ“ๅ‡บ: Hยณ (30, 768) โ”‚
โ”‚ โ”‚
โ”‚ ๅญฆๅˆฐ็š„ๆจกๅผ: โ”‚
โ”‚ โœ“ Documents: "่ฟญไปฃๆ”น่ฟ›" vs Generation: ๆ— ๆญคไฟกๆฏ โ”‚
โ”‚ โœ“ Generation: "Google" vs Documents: ๆ— ๆญคไฟกๆฏ โš ๏ธ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 4-6: ไธญ้—ดๅฑ‚ โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ โ”‚
โ”‚ ่พ“ๅ…ฅ: Hยณ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ ๅคšๅฑ‚ Self-Attention + FFN โ”‚
โ”‚ โ†“ โ”‚
โ”‚ ่พ“ๅ‡บ: Hโถ (30, 768) โ”‚
โ”‚ โ”‚
โ”‚ ๅญฆๅˆฐ็š„ๆจกๅผ: โ”‚
โ”‚ โœ“ ๅคๆ‚็š„่ฏญไน‰ๅ…ณ็ณป โ”‚
โ”‚ โœ“ Documents ๅ’Œ Generation ็š„ๅฏนๆฏ” โ”‚
โ”‚ โœ“ ่ฏ†ๅˆซไธไธ€่‡ด็š„ๅœฐๆ–น: โ”‚
โ”‚ - "ๆ–นๆณ•" vs "ๅทฅๅ…ท" โ”‚
โ”‚ - ็ผบๅคฑ "Google" ๅ’Œ "2024" ็š„ๆฅๆบ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 7-9: ๆทฑๅฑ‚ๆŠฝ่ฑก โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ โ”‚
โ”‚ ่พ“ๅ…ฅ: Hโถ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ ๅคšๅฑ‚ Self-Attention + FFN โ”‚
โ”‚ โ†“ โ”‚
โ”‚ ่พ“ๅ‡บ: Hโน (30, 768) โ”‚
โ”‚ โ”‚
โ”‚ ๅญฆๅˆฐ็š„ๆจกๅผ: โ”‚
โ”‚ โœ“ ้ซ˜ๅฑ‚่ฏญไน‰็†่งฃ โ”‚
โ”‚ โœ“ [CLS] ๅ‘้‡ๅผ€ๅง‹่šๅˆๅˆคๆ–ญไฟกๆฏ: โ”‚
โ”‚ - Documents ่ฏด: "ไปฃ็ ็”Ÿๆˆๆ–นๆณ•๏ผŒ่ฟญไปฃๆ”น่ฟ›" โ”‚
โ”‚ - Generation ่ฏด: "Google ๅ‘ๅธƒ็š„ๅทฅๅ…ท" โ”‚
โ”‚ โ†’ ๅ‘็ŽฐไธๅŒน้…๏ผโš ๏ธ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 10-12: ๆœ€็ปˆๅฑ‚๏ผˆๅ†ณ็ญ–ๅฑ‚๏ผ‰ โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚
โ”‚ โ”‚
โ”‚ ่พ“ๅ…ฅ: Hโน โ”‚
โ”‚ โ†“ โ”‚
โ”‚ ๅคšๅฑ‚ Self-Attention + FFN โ”‚
โ”‚ โ†“ โ”‚
โ”‚ ่พ“ๅ‡บ: Hยนยฒ (30, 768) โ”‚
โ”‚ โ”‚
โ”‚ [CLS] ๅ‘้‡็š„ไฟกๆฏ๏ผˆๆœ€ๅ…ณ้”ฎ๏ผ‰: โ”‚
โ”‚ โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” โ”‚
โ”‚ โ”‚
โ”‚ Hยนยฒ[0] = [0.234, -0.567, 0.890, ..., 0.123] (768็ปด) โ”‚
โ”‚ โ†‘ [CLS] token ็š„ๆœ€็ปˆๅ‘้‡ โ”‚
โ”‚ โ”‚
โ”‚ ่ฟ™ไธชๅ‘้‡็ผ–็ ไบ†: โ”‚
โ”‚ โœ“ Documents ็š„ๅฎŒๆ•ดไฟกๆฏ โ”‚
โ”‚ โœ“ Generation ็š„ๅฎŒๆ•ดไฟกๆฏ โ”‚
โ”‚ โœ“ ไธค่€…็š„ๅ…ณ็ณป: โ”‚
โ”‚ - ๆœ‰ๅ“ชไบ›ไฟกๆฏไธ€่‡ด โ”‚
โ”‚ - ๆœ‰ๅ“ชไบ›ไฟกๆฏ็Ÿ›็›พ โ”‚
โ”‚ - Generation ๆทปๅŠ ไบ†ๅ“ชไบ› Documents ไธญๆฒกๆœ‰็š„ไฟกๆฏ โ”‚
โ”‚ โ”‚
โ”‚ ๅ…ทไฝ“่ฏ†ๅˆซๅˆฐ็š„้—ฎ้ข˜: โ”‚
โ”‚ โŒ "Google" ๅœจ Documents ไธญๆ‰พไธๅˆฐๅฏนๅบ” โ”‚
โ”‚ โŒ "2024" ๅœจ Documents ไธญๆ‰พไธๅˆฐๅฏนๅบ” โ”‚
โ”‚ โš ๏ธ "ๅทฅๅ…ท" vs "ๆ–นๆณ•" ่ฏญไน‰ๅทฎๅผ‚ โ”‚
โ”‚ โ”‚
โ”‚ โ†’ ๅ‡†ๅค‡่พ“ๅ‡บๅˆฐๅˆ†็ฑปๅคด๏ผŒๅˆคๆ–ญไธบ "Hallucinated" โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๆœ€็ปˆ่พ“ๅ‡บ: Hยนยฒ (30, 768)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๅชไฝฟ็”จ Hยนยฒ[0]๏ผˆ[CLS] ็š„ๅ‘้‡๏ผ‰้€ๅ…ฅๅˆ†็ฑปๅคด:
[CLS] Vector = Hยนยฒ[0] = [0.234, -0.567, 0.890, ..., 0.123]
โ†“
ๅˆ†็ฑปๅคด (768 โ†’ 2)
โ†“
Logits: [0.8, 4.2]
โ†‘ โ†‘
Factual Hallucinated
โ†“
Softmax
โ†“
Probs: [0.03, 0.97]
โ†‘ โ†‘
3%ไบ‹ๅฎž 97%ๅนป่ง‰
ๅˆคๆ–ญ: Hallucinated โŒ (็ฝฎไฟกๅบฆ 97%)
""")
# ============================================================================
# Part 6: ๅ…ณ้”ฎ Attention ๆจกๅผๅฏ่ง†ๅŒ–
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ‘๏ธ Part 6: ๅ…ณ้”ฎ Attention ๆจกๅผๅฏ่ง†ๅŒ–")
print("=" * 80)
print("""
Layer 12 ็š„ Attention ๆƒ้‡็Ÿฉ้˜ต๏ผˆ็ฎ€ๅŒ–๏ผŒๅชๆ˜พ็คบๅ…ณ้”ฎ tokens๏ผ‰:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Query โ†“ Key Tokens โ†’
Tokens [CLS] Alphaยน ๆ–นๆณ• [SEP] Alphaยฒ Google 2024 ๅทฅๅ…ท [SEP]
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
[CLS] [0.15 0.08 0.12 0.05 0.07 0.18 0.16 0.10 0.05]
โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไธญ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘้ซ˜โš ๏ธ โ†‘้ซ˜โš ๏ธ โ†‘ไธญ โ†‘ไฝŽ
Alphaยน [0.05 0.30 0.08 0.03 0.25 0.04 0.03 0.05 0.02]
โ†‘ไฝŽ โ†‘้ซ˜โœ“ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘้ซ˜โœ“ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไฝŽ
ๆ–นๆณ• [0.08 0.10 0.25 0.05 0.08 0.06 0.05 0.20 0.03]
โ†‘ไฝŽ โ†‘ไฝŽ โ†‘้ซ˜โœ“ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไธญโš ๏ธ โ†‘ไฝŽ
Google [0.10 0.05 0.03 0.02 0.06 0.40 0.15 0.08 0.02]
โ†‘ไธญโš ๏ธ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘้ซ˜โœ“ โ†‘ไธญ โ†‘ไฝŽ โ†‘ไฝŽ
2024 [0.12 0.04 0.02 0.01 0.05 0.18 0.35 0.07 0.01]
โ†‘ไธญโš ๏ธ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไธญ โ†‘้ซ˜โœ“ โ†‘ไฝŽ โ†‘ไฝŽ
ๅทฅๅ…ท [0.09 0.08 0.15 0.03 0.09 0.07 0.06 0.30 0.02]
โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไธญโš ๏ธ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘ไฝŽ โ†‘้ซ˜โœ“ โ†‘ไฝŽ
ๅ…ณ้”ฎ่ง‚ๅฏŸ:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โœ“ ๆญฃๅธธๆจกๅผ:
- "Alphaยน" attendๅˆฐ "Alphaยฒ" (0.25) โ† ๅŒไธ€ๅฎžไฝ“
- "ๆ–นๆณ•" attendๅˆฐ่‡ชๅทฑ (0.25) โ† ่‡ชๆณจๆ„ๅŠ›
โš ๏ธ ๅนป่ง‰ๆŒ‡็คบ:
- "Google" ไธป่ฆ attendๅˆฐ่‡ชๅทฑ (0.40)
โ†’ ๅœจ Documents ไธญๆ‰พไธๅˆฐๅผบๅ…ณ่”๏ผ
- "2024" ไธป่ฆ attendๅˆฐ่‡ชๅทฑ (0.35)
โ†’ ๅœจ Documents ไธญๆ‰พไธๅˆฐๅผบๅ…ณ่”๏ผ
- [CLS] attendๅˆฐ "Google" (0.18) ๅ’Œ "2024" (0.16)
โ†’ [CLS] ๆณจๆ„ๅˆฐ่ฟ™ไบ›ๅผ‚ๅธธ่ฏ๏ผ
- "ๅทฅๅ…ท" ๅฏน "ๆ–นๆณ•" ็š„ attention (0.15)
โ†’ ่ฏญไน‰็›ธไผผไฝ†ไธๅฎŒๅ…จไธ€่‡ด
่ฟ™ไบ›ๆจกๅผ่ขซๅˆ†็ฑปๅคดๅญฆไน ๅนถ็”จไบŽๅˆคๆ–ญๅนป่ง‰๏ผ
""")
# ============================================================================
# Part 7: ๅ‚ๆ•ฐ็ปŸ่ฎก
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ“Š Part 7: BERT Encoder ๅ‚ๆ•ฐ็ปŸ่ฎก")
print("=" * 80)
print("""
BERT-base ๅ‚ๆ•ฐ่ฏฆ็ป†็ปŸ่ฎก:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
1. Embedding ๅฑ‚:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
- Token Embedding: 30,522 ร— 768 = 23,440,896
- Segment Embedding: 2 ร— 768 = 1,536
- Position Embedding: 512 ร— 768 = 393,216
ๅฐ่ฎก: 23,835,648 ๅ‚ๆ•ฐ
2. ๆฏไธช Encoder Layer:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Multi-Head Attention:
- W^Q: 768 ร— 768 = 589,824
- W^K: 768 ร— 768 = 589,824
- W^V: 768 ร— 768 = 589,824
- W^O: 768 ร— 768 = 589,824
- Biases: 768 ร— 4 = 3,072
ๅฐ่ฎก: 2,362,368 ๅ‚ๆ•ฐ
Feed Forward Network:
- Wโ‚: 768 ร— 3,072 = 2,359,296
- bโ‚: 3,072
- Wโ‚‚: 3,072 ร— 768 = 2,359,296
- bโ‚‚: 768
ๅฐ่ฎก: 4,722,432 ๅ‚ๆ•ฐ
Layer Normalization (ร—2):
- ฮณ, ฮฒ: 768 ร— 2 ร— 2 = 3,072
ๆฏๅฑ‚ๆ€ป่ฎก: 2,362,368 + 4,722,432 + 3,072 = 7,087,872 ๅ‚ๆ•ฐ
3. 12 ๅฑ‚ Encoder:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
7,087,872 ร— 12 = 85,054,464 ๅ‚ๆ•ฐ
4. ๅˆ†็ฑปๅคด๏ผˆHHEM ็‰นๆœ‰๏ผ‰:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
- W: 768 ร— 2 = 1,536
- b: 2
ๅฐ่ฎก: 1,538 ๅ‚ๆ•ฐ
ๆ€ปๅ‚ๆ•ฐ้‡:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
23,835,648 (Embeddings)
+ 85,054,464 (12 Layers)
+ 1,538 (Classification Head)
= 108,891,650 ๅ‚ๆ•ฐ
็บฆ 109M (็™พไธ‡) ๅ‚ๆ•ฐ
ๆจกๅž‹ๅคงๅฐ: 109M ร— 4 bytes = 436 MB
ๅ†…ๅญ˜ๅ ็”จ๏ผˆๆŽจ็†ๆ—ถ๏ผ‰:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
- ๆจกๅž‹ๅ‚ๆ•ฐ: 436 MB
- ๆฟ€ๆดปๅ€ผ (batch_size=1, seq_len=30):
ๆฏๅฑ‚: 30 ร— 768 ร— 4 bytes ร— 2 (residual) = 184 KB
12 ๅฑ‚: 184 KB ร— 12 = 2.2 MB
- ๆ€ป่ฎก: ~438 MB (FP32)
~220 MB (FP16๏ผŒไฝฟ็”จๅŠ็ฒพๅบฆ)
""")
# ============================================================================
# Part 8: ๆ€ป็ป“
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ“š Part 8: ๆ ธๅฟƒ่ฆ็‚นๆ€ป็ป“")
print("=" * 80)
print("""
BERT Encoder 12ๅฑ‚่”ๅˆ็ผ–็ ๆ ธๅฟƒ่ฆ็‚น:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
1. ่พ“ๅ…ฅๅ‡†ๅค‡
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
[CLS] Documents [SEP] Generation [SEP]
โ†’ Tokenization (30 tokens)
โ†’ Token + Segment + Position Embeddings
โ†’ Hโฐ (30, 768)
2. ๆฏๅฑ‚็ป“ๆž„
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
H^(l-1)
โ†“
Multi-Head Self-Attention (12 heads)
โ†“
Add & Norm
โ†“
Feed Forward Network
โ†“
Add & Norm
โ†“
H^l
3. Multi-Head Attention ๅ…ณ้”ฎ
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Q, K, V = H ร— W^Q, H ร— W^K, H ร— W^V
โ†“
ๅˆ†ๆˆ 12 ไธช Head (ๆฏไธช 64 ็ปด)
โ†“
Attention = softmax(Qร—K^T / โˆš64) ร— V
โ†“
Concat ๆ‰€ๆœ‰ Heads โ†’ (768 ็ปด)
4. 12ๅฑ‚้€ๅฑ‚ๅญฆไน 
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Layer 1-3: ๅŸบๆœฌ่ฏญๆณ•ใ€่ฏ่ฏญๅ…ณ็ณป
Layer 4-6: ๅคๆ‚่ฏญไน‰ใ€้•ฟ่ท็ฆปไพ่ต–
Layer 7-9: ้ซ˜ๅฑ‚ๆŠฝ่ฑกใ€ไธไธ€่‡ดๆฃ€ๆต‹
Layer 10-12: ๆœ€็ปˆๅˆคๆ–ญใ€ไฟกๆฏ่šๅˆๅˆฐ [CLS]
5. ๅนป่ง‰ๆฃ€ๆต‹ๆœบๅˆถ
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
้€š่ฟ‡ Attention ๆƒ้‡ๅ‘็Žฐ:
โœ“ "Google" ๅœจ Documents ไธญๆ— ๅผบๅ…ณ่”
โœ“ "2024" ๅœจ Documents ไธญๆ— ๅผบๅ…ณ่”
โœ“ [CLS] ่šๅˆ่ฟ™ไบ›ไฟกๆฏ
โ†“
Hยนยฒ[0] (768็ปด) โ†’ ๅˆ†็ฑปๅคด (768โ†’2)
โ†“
[Factual: 0.03, Hallucinated: 0.97]
โ†“
ๅˆคๆ–ญ: Hallucinated โŒ
6. ๅ…ณ้”ฎๅ‚ๆ•ฐ
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
- Hidden Size: 768
- Layers: 12
- Attention Heads: 12
- Head Dimension: 64
- FFN Size: 3072
- Total Parameters: 109M
- Model Size: 436 MB (FP32)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
่”ๅˆ็ผ–็ ็š„ไผ˜ๅŠฟ:
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โœ… Documents ๅ’Œ Generation ๅฏไปฅไบ’็›ธ attend
โœ… ๆจกๅž‹่ƒฝๆ•ๆ‰ไธค่€…ไน‹้—ด็š„ไธ€่‡ดๆ€ง/็Ÿ›็›พ
โœ… [CLS] ๅ‘้‡่šๅˆไบ†ๅ…จๅฑ€ๅˆคๆ–ญไฟกๆฏ
โœ… 12 ๅฑ‚้€ๅฑ‚ๆทฑๅŒ–็†่งฃ๏ผŒๆœ€็ปˆๅ‡†็กฎๅˆคๆ–ญๅนป่ง‰
่ฟ™ๅฐฑๆ˜ฏไธบไป€ไนˆ BERT Cross-Encoder ๅœจๅนป่ง‰ๆฃ€ๆต‹ไธŠ่กจ็Žฐไผ˜็ง€๏ผ
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
""")
print("\n" + "=" * 80)
print("โœ… BERT Encoder 12ๅฑ‚่ฏฆ็ป†่งฃๆžๅฎŒๆฏ•๏ผ")
print("=" * 80)
print()