adaptive_rag / vectorization_process_explained.py
lanny xu
modify reranker
dbd527a
raw
history blame
23.2 kB
"""
ๅ‘้‡ๅŒ–ๅ’Œ Chroma ๅญ˜ๅ‚จ่ฟ‡็จ‹่ฏฆ่งฃ
ไปŽๅˆ‡ๅ‰ฒๅŽ็š„ๆ–‡ๆกฃๅˆฐๅ‘้‡ๆ•ฐๆฎๅบ“็š„ๅฎŒๆ•ดๆต็จ‹
"""
print("=" * 80)
print("ๅ‘้‡ๅŒ–ๅ’Œ Chroma ๅญ˜ๅ‚จ่ฟ‡็จ‹่ฏฆ่งฃ")
print("=" * 80)
# ============================================================================
# Part 1: ๅฎŒๆ•ดๆต็จ‹ๆฆ‚่งˆ
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ“Š Part 1: ๅฎŒๆ•ดๆต็จ‹ๆฆ‚่งˆ")
print("=" * 80)
print("""
ไปŽๆ–‡ๆกฃๅˆ‡ๅ‰ฒๅˆฐๅ‘้‡ๆ•ฐๆฎๅบ“็š„ๅฎŒๆ•ดๆต็จ‹๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Step 1: ๆ–‡ๆกฃๅˆ‡ๅ‰ฒ
ๅŽŸๅง‹ๆ–‡ๆกฃ โ†’ RecursiveCharacterTextSplitter โ†’ 20 ไธช chunks
(5000 tokens) (ๆฏไธช 250 tokens)
Step 2: ๅ‘้‡ๅŒ– (Embedding)
ๆฏไธช chunk โ†’ HuggingFace ๆจกๅž‹ โ†’ ๅ‘้‡ (384็ปด)
"ไบบๅทฅๆ™บ่ƒฝๆ˜ฏ..." โ†’ [0.12, -0.34, 0.56, ...]
Step 3: ๅญ˜ๅ…ฅ Chroma
ๅ‘้‡ + ๅŽŸๆ–‡ + ๅ…ƒๆ•ฐๆฎ โ†’ Chroma ๆ•ฐๆฎๅบ“
โ””โ”€ ๆŒไน…ๅŒ–ๅญ˜ๅ‚จ
Step 4: ๆž„ๅปบ็ดขๅผ•
Chroma โ†’ HNSW ็ดขๅผ• โ†’ ๅฟซ้€Ÿ่ฟ‘ไผผๆฃ€็ดข
(ๅฑ‚ๆฌกๅŒ–ๅ›พ็ป“ๆž„)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
""")
# ============================================================================
# Part 2: Embedding ๆจกๅž‹่ฏฆ่งฃ
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿค– Part 2: Embedding ๆจกๅž‹ - HuggingFaceEmbeddings")
print("=" * 80)
print("""
ไฝ ็š„้กน็›ฎ้…็ฝฎ๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
self.embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={'device': device}, # CPU ๆˆ– GPU
encode_kwargs={'normalize_embeddings': True} # ๅฝ’ไธ€ๅŒ–
)
ๆจกๅž‹่ฏดๆ˜Ž๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๆจกๅž‹ๅ็งฐ: all-MiniLM-L6-v2
โ”œโ”€ ็ฑปๅž‹: Sentence-BERT (ๅŒ็ผ–็ ๅ™จ)
โ”œโ”€ ๅ‚ๆ•ฐ้‡: 22M (่ฝป้‡็บง)
โ”œโ”€ ่พ“ๅ‡บ็ปดๅบฆ: 384 ็ปดๅ‘้‡
โ”œโ”€ ่ฎญ็ปƒๆ•ฐๆฎ: 10ไบฟ+ ๅฅๅญๅฏน
โ””โ”€ ็‰น็‚น: ๅฟซ้€Ÿใ€ๅ‡†็กฎใ€้€‚ๅˆ่ฏญไน‰ๆฃ€็ดข
ๅทฅไฝœๅŽŸ็†๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
่พ“ๅ…ฅๆ–‡ๆœฌ: "ไบบๅทฅๆ™บ่ƒฝๆ˜ฏ่ฎก็ฎ—ๆœบ็ง‘ๅญฆ็š„ไธ€ไธชๅˆ†ๆ”ฏ"
โ†“
Tokenization (ๅˆ†่ฏ)
โ†“
Token IDs: [101, 782, 1435, 1819, 2510, 3221, ...]
โ†“
BERT Encoder (6 ๅฑ‚ Transformer)
โ†“
[CLS] Token ็š„ๅ‘้‡่กจ็คบ
โ†“
384 ็ปดๅ‘้‡: [0.123, -0.456, 0.789, ...]
โ†“
L2 ๅฝ’ไธ€ๅŒ– (normalize_embeddings=True)
โ†“
ๆœ€็ปˆๅ‘้‡: ||v|| = 1 (ๅ•ไฝๅ‘้‡)
""")
# ============================================================================
# Part 3: ๅ‘้‡ๅŒ–่ฟ‡็จ‹ๅˆ†ๆญฅ่งฃๆž
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ” Part 3: ๅ‘้‡ๅŒ–่ฟ‡็จ‹ - ้€ๆญฅ่งฃๆž")
print("=" * 80)
print("""
ๅ‡่ฎพๆˆ‘ไปฌๆœ‰ 3 ไธช chunks๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Chunk 1: "ไบบๅทฅๆ™บ่ƒฝๆ˜ฏ่ฎก็ฎ—ๆœบ็ง‘ๅญฆ็š„ไธ€ไธชๅˆ†ๆ”ฏใ€‚ๅฎƒ่‡ดๅŠ›ไบŽ..."
Chunk 2: "ๆœบๅ™จๅญฆไน ๆ˜ฏไบบๅทฅๆ™บ่ƒฝ็š„ๅญ้ข†ๅŸŸใ€‚ๅฎƒไฝฟ่ฎก็ฎ—ๆœบ..."
Chunk 3: "ๆทฑๅบฆๅญฆไน ไฝฟ็”จๅคšๅฑ‚็ฅž็ป็ฝ‘็ปœๆฅๅค„็†ๅคๆ‚็š„..."
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๅ‘้‡ๅŒ–่ฟ‡็จ‹๏ผˆๆ‰น้‡ๅค„็†๏ผ‰๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
embeddings.embed_documents([chunk1, chunk2, chunk3])
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ HuggingFace Embedding ๆจกๅž‹ โ”‚
โ”‚ (sentence-transformers/all-MiniLM-L6-v2) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
ๅ†…้ƒจๅค„็†๏ผˆๆฏไธช chunk๏ผ‰๏ผš
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Step 1: Tokenization โ”‚
โ”‚ "ไบบๅทฅๆ™บ่ƒฝ..." โ†’ [101, 782, 1435, ...] โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Step 2: ่ฝฌๆขไธบ Token Embeddings โ”‚
โ”‚ Token IDs โ†’ ๅˆๅง‹ๅ‘้‡่กจ็คบ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Step 3: BERT Encoder (6 ๅฑ‚) โ”‚
โ”‚ Self-Attention + Feed Forward โ”‚
โ”‚ ๆฏๅฑ‚ๆๅ–ๆ›ดๆทฑๅฑ‚็š„่ฏญไน‰ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Step 4: Mean Pooling โ”‚
โ”‚ ๆ‰€ๆœ‰ token ๅ‘้‡็š„ๅนณๅ‡ โ†’ ๅฅๅญๅ‘้‡ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Step 5: L2 Normalization โ”‚
โ”‚ ๅ‘้‡ๅฝ’ไธ€ๅŒ–ๅˆฐๅ•ไฝ้•ฟๅบฆ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“
่พ“ๅ‡บ๏ผš3 ไธชๅ‘้‡
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Vector 1: [0.123, -0.456, 0.789, ..., 0.321] (384็ปด) โ”‚
โ”‚ Vector 2: [0.234, 0.567, -0.890, ..., 0.432] (384็ปด) โ”‚
โ”‚ Vector 3: [-0.345, 0.678, 0.901, ..., -0.543] (384็ปด) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
ๅ…ณ้”ฎ็‚น๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โœ… ๆฏไธช chunk โ†’ 1 ไธชๅ›บๅฎš็ปดๅบฆ็š„ๅ‘้‡ (384็ปด)
โœ… ่ฏญไน‰็›ธไผผ็š„ๆ–‡ๆœฌ โ†’ ๅ‘้‡่ท็ฆป่ฟ‘
โœ… ๅฝ’ไธ€ๅŒ–ๅŽๅฏ็”จไฝ™ๅผฆ็›ธไผผๅบฆๅฟซ้€Ÿๆฏ”่พƒ
""")
# ============================================================================
# Part 4: Chroma ๆ•ฐๆฎๅบ“ๅญ˜ๅ‚จ็ป“ๆž„
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ’พ Part 4: Chroma ๆ•ฐๆฎๅบ“ๅญ˜ๅ‚จ็ป“ๆž„")
print("=" * 80)
print("""
Chroma.from_documents() ๆ‰ง่กŒ็š„ๆ“ไฝœ๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Chroma.from_documents(
documents=doc_splits, # 20 ไธช chunks
collection_name="rag-chroma", # ้›†ๅˆๅ็งฐ
embedding=self.embeddings # Embedding ๅ‡ฝๆ•ฐ
)
ๅ†…้ƒจๆต็จ‹๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Step 1: ๅˆ›ๅปบ/ๆ‰“ๅผ€้›†ๅˆ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Collection: "rag-chroma" โ”‚
โ”‚ ๅ…ƒๆ•ฐๆฎ: embedding_dimension=384 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Step 2: ๆ‰น้‡ๅ‘้‡ๅŒ–
for chunk in doc_splits:
vector = embeddings.embed_documents([chunk.page_content])
โ†“
Step 3: ๅญ˜ๅ‚จๆ•ฐๆฎ๏ผˆๆฏไธช chunk๏ผ‰
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ID: "chunk_1" โ”‚
โ”‚ โ”œโ”€ Vector: [0.123, -0.456, ..., 0.321] (384็ปด) โ”‚
โ”‚ โ”œโ”€ Document: "ไบบๅทฅๆ™บ่ƒฝๆ˜ฏ่ฎก็ฎ—ๆœบ็ง‘ๅญฆ็š„ไธ€ไธชๅˆ†ๆ”ฏ..." โ”‚
โ”‚ โ””โ”€ Metadata: { โ”‚
โ”‚ "source": "https://...", โ”‚
โ”‚ "chunk_index": 0, โ”‚
โ”‚ "total_chunks": 20 โ”‚
โ”‚ } โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ID: "chunk_2" โ”‚
โ”‚ โ”œโ”€ Vector: [0.234, 0.567, ..., 0.432] โ”‚
โ”‚ โ”œโ”€ Document: "ๆœบๅ™จๅญฆไน ๆ˜ฏไบบๅทฅๆ™บ่ƒฝ็š„ๅญ้ข†ๅŸŸ..." โ”‚
โ”‚ โ””โ”€ Metadata: {...} โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ID: "chunk_3" โ”‚
โ”‚ โ”œโ”€ Vector: [-0.345, 0.678, ..., -0.543] โ”‚
โ”‚ โ”œโ”€ Document: "ๆทฑๅบฆๅญฆไน ไฝฟ็”จๅคšๅฑ‚็ฅž็ป็ฝ‘็ปœ..." โ”‚
โ”‚ โ””โ”€ Metadata: {...} โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Step 4: ๆž„ๅปบ HNSW ็ดขๅผ•
ๅ‘้‡ โ†’ HNSW ๅ›พ็ป“ๆž„ โ†’ ๅฟซ้€Ÿๆฃ€็ดข
(ๅฑ‚ๆฌกๅŒ–ๅฏผ่ˆชๅฐไธ–็•Œๅ›พ)
ๅญ˜ๅ‚จไฝ็ฝฎ๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
้ป˜่ฎค่ทฏๅพ„: ./chroma/ (ๆœฌๅœฐ็›ฎๅฝ•)
โ”œโ”€ collections/
โ”‚ โ””โ”€ rag-chroma/
โ”‚ โ”œโ”€ data.parquet # ๅ‘้‡ๆ•ฐๆฎ
โ”‚ โ”œโ”€ metadata.json # ๅ…ƒๆ•ฐๆฎ
โ”‚ โ””โ”€ index.bin # HNSW ็ดขๅผ•
โ””โ”€ chroma.sqlite3 # SQLite ๆ•ฐๆฎๅบ“
""")
# ============================================================================
# Part 5: HNSW ็ดขๅผ•ๅทฅไฝœๅŽŸ็†
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ”— Part 5: HNSW ็ดขๅผ• - ๅฟซ้€Ÿๆฃ€็ดข็š„็ง˜ๅฏ†")
print("=" * 80)
print("""
HNSW = Hierarchical Navigable Small World
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ไธบไป€ไนˆ้œ€่ฆ็ดขๅผ•๏ผŸ
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๆšดๅŠ›ๆœ็ดข: O(n) - ่ฎก็ฎ—ๆŸฅ่ฏขๅ‘้‡ไธŽๆ‰€ๆœ‰ๅ‘้‡็š„่ท็ฆป
โ””โ”€ 10000 ไธชๅ‘้‡ โ†’ ้œ€่ฆ่ฎก็ฎ— 10000 ๆฌก่ท็ฆป
โ””โ”€ ๅคชๆ…ข๏ผ
HNSW ็ดขๅผ•: O(log n) - ๅฑ‚ๆฌกๅŒ–ๅ›พ็ป“ๆž„ๅฏผ่ˆช
โ””โ”€ 10000 ไธชๅ‘้‡ โ†’ ๅช้œ€ๆฃ€ๆŸฅ็บฆ 20-30 ไธช่Š‚็‚น
โ””โ”€ ๅฟซ 100+ ๅ€๏ผ
HNSW ็ป“ๆž„๏ผˆ็ฎ€ๅŒ–็คบไพ‹๏ผ‰๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Layer 2 (ๆœ€็จ€็–)
Vโ‚ โ†โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Vโ‚… โ†โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Vโ‚โ‚‚
โ†“ โ†“ โ†“
Layer 1
Vโ‚ โ†โ†’ Vโ‚ƒ โ†โ†’ Vโ‚… โ†โ†’ Vโ‚ˆ โ†โ†’ Vโ‚โ‚‚
โ†“ โ†“ โ†“ โ†“ โ†“
Layer 0 (ๆœ€ๅฏ†้›†)
Vโ‚ โ† Vโ‚‚ โ† Vโ‚ƒ โ† Vโ‚„ โ† Vโ‚… โ† Vโ‚† โ† ... โ† Vโ‚โ‚‚
ๆ‰€ๆœ‰ๅ‘้‡้ƒฝๅœจ่ฟ™ไธ€ๅฑ‚
ๆฃ€็ดข่ฟ‡็จ‹๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๆŸฅ่ฏขๅ‘้‡: Q = [0.2, -0.3, 0.5, ...]
Step 1: ไปŽ Layer 2 ๅผ€ๅง‹๏ผˆ็ฒ—็•ฅๆœ็ดข๏ผ‰
ๅ…ฅๅฃ็‚น: Vโ‚
โ†’ ่ฎก็ฎ— dist(Q, Vโ‚), dist(Q, Vโ‚…), dist(Q, Vโ‚โ‚‚)
โ†’ Vโ‚… ๆœ€่ฟ‘ โ†’ ่ทณๅˆฐ Vโ‚…
Step 2: ไธ‹้™ๅˆฐ Layer 1๏ผˆไธญ็ญ‰็ฒพๅบฆ๏ผ‰
ไปŽ Vโ‚… ๅผ€ๅง‹
โ†’ ๆฃ€ๆŸฅ้‚ปๅฑ… Vโ‚ƒ, Vโ‚ˆ
โ†’ Vโ‚ˆ ๆœ€่ฟ‘ โ†’ ่ทณๅˆฐ Vโ‚ˆ
Step 3: ไธ‹้™ๅˆฐ Layer 0๏ผˆ้ซ˜็ฒพๅบฆ๏ผ‰
ไปŽ Vโ‚ˆ ๅผ€ๅง‹
โ†’ ๆฃ€ๆŸฅๆ‰€ๆœ‰้‚ปๅฑ…
โ†’ ๆ‰พๅˆฐๆœ€่ฟ‘็š„ K ไธชๅ‘้‡
่ฟ”ๅ›ž็ป“ๆžœ: Top K ๆœ€็›ธไผผ็š„ chunks
้€Ÿๅบฆๅฏนๆฏ”๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๆšดๅŠ›ๆœ็ดข: 10000 ๆฌก่ท็ฆป่ฎก็ฎ— โ†’ 100ms
HNSW ็ดขๅผ•: 20-30 ๆฌก่ท็ฆป่ฎก็ฎ— โ†’ 1ms โ† ๅฟซ 100 ๅ€๏ผ
""")
# ============================================================================
# Part 6: ๆฃ€็ดข่ฟ‡็จ‹่ฏฆ่งฃ
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ” Part 6: ๆฃ€็ดข่ฟ‡็จ‹ - ไปŽๆŸฅ่ฏขๅˆฐ็ป“ๆžœ")
print("=" * 80)
print("""
็”จๆˆทๆŸฅ่ฏข: "ไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๏ผŸ"
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
Step 1: ๆŸฅ่ฏขๅ‘้‡ๅŒ–
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
"ไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๏ผŸ"
โ†“
embeddings.embed_query("ไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๏ผŸ")
โ†“
Query Vector: [0.345, -0.678, 0.234, ...] (384็ปด)
Step 2: HNSW ่ฟ‘ไผผๆœ็ดข
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
vectorstore.similarity_search(
query="ไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๏ผŸ",
k=20 # ่ฟ”ๅ›ž Top 20
)
โ†“
Chroma ๅ†…้ƒจ:
1. ๆŸฅ่ฏขๅ‘้‡ๅŒ–
2. HNSW ๅ›พๅฏผ่ˆช
3. ่ฎก็ฎ—ไฝ™ๅผฆ็›ธไผผๅบฆ
โ†“
่ฟ”ๅ›ž Top 20 chunks:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Chunk ID โ”‚ Score โ”‚ Content โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ chunk_5 โ”‚ 0.92 โ”‚ "ๆœบๅ™จๅญฆไน ๆ˜ฏไบบๅทฅๆ™บ่ƒฝ็š„..." โ”‚
โ”‚ chunk_2 โ”‚ 0.88 โ”‚ "ไบบๅทฅๆ™บ่ƒฝๅŒ…ๆ‹ฌๆœบๅ™จๅญฆไน ..." โ”‚
โ”‚ chunk_11 โ”‚ 0.85 โ”‚ "็›‘็ฃๅญฆไน ๆ˜ฏๆœบๅ™จๅญฆไน ..." โ”‚
โ”‚ ... โ”‚ ... โ”‚ ... โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Step 3: CrossEncoder ้‡ๆŽ’๏ผˆไฝ ็š„้กน็›ฎ็‰น่‰ฒ๏ผ‰
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
reranker.rerank(query, top_20_chunks, top_k=5)
โ†“
ๆฏไธช chunk ้‡ๆ–ฐๆ‰“ๅˆ†๏ผˆๆทฑๅบฆไบคไบ’๏ผ‰
โ†“
ๆœ€็ปˆ Top 5:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Chunk ID โ”‚ Score โ”‚ Content โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ chunk_5 โ”‚ 8.45 โ”‚ "ๆœบๅ™จๅญฆไน ๆ˜ฏไบบๅทฅๆ™บ่ƒฝ็š„..." โ”‚
โ”‚ chunk_11 โ”‚ 7.89 โ”‚ "็›‘็ฃๅญฆไน ๆ˜ฏๆœบๅ™จๅญฆไน ..." โ”‚
โ”‚ chunk_2 โ”‚ 7.23 โ”‚ "ไบบๅทฅๆ™บ่ƒฝๅŒ…ๆ‹ฌๆœบๅ™จๅญฆไน ..." โ”‚
โ”‚ chunk_14 โ”‚ 6.78 โ”‚ "ๆทฑๅบฆๅญฆไน ๆ˜ฏๆœบๅ™จๅญฆไน ..." โ”‚
โ”‚ chunk_8 โ”‚ 6.12 โ”‚ "ๅผบๅŒ–ๅญฆไน ๅ…่ฎธ..." โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Step 4: ่ฟ”ๅ›ž็ป™ LLM
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
context = "\\n\\n".join([chunk.page_content for chunk in top_5])
โ†“
LLM ็”Ÿๆˆ็ญ”ๆกˆ
""")
# ============================================================================
# Part 7: ๅ…ณ้”ฎๆŠ€ๆœฏ็ป†่Š‚
# ============================================================================
print("\n" + "=" * 80)
print("โš™๏ธ Part 7: ๅ…ณ้”ฎๆŠ€ๆœฏ็ป†่Š‚")
print("=" * 80)
print("""
1. ไธบไป€ไนˆ่ฆๅฝ’ไธ€ๅŒ–ๅ‘้‡๏ผŸ
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
encode_kwargs={'normalize_embeddings': True}
ๅŽŸๅง‹ๅ‘้‡: [1.23, -4.56, 7.89, ...] # ้•ฟๅบฆไธไธ€
ๅฝ’ไธ€ๅŒ–ๅŽ: [0.12, -0.45, 0.78, ...] # ้•ฟๅบฆ = 1
ๅฅฝๅค„:
โœ… ไฝ™ๅผฆ็›ธไผผๅบฆ = ็‚น็งฏ๏ผˆ่ฎก็ฎ—ๆ›ดๅฟซ๏ผ‰
โœ… ๆ‰€ๆœ‰ๅ‘้‡ๅœจๅŒไธ€ๅฐบๅบฆไธŠ
โœ… ้ฟๅ…้•ฟๅบฆๅฝฑๅ“็›ธไผผๅบฆ่ฎก็ฎ—
2. ไฝ™ๅผฆ็›ธไผผๅบฆ vs ๆฌงๆฐ่ท็ฆป
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ไฝ™ๅผฆ็›ธไผผๅบฆ๏ผˆไฝ ็š„้กน็›ฎไฝฟ็”จ๏ผ‰โญ:
similarity = vโ‚ ยท vโ‚‚ / (||vโ‚|| ร— ||vโ‚‚||)
่Œƒๅ›ด: [-1, 1]๏ผŒ1 ่กจ็คบๅฎŒๅ…จ็›ธๅŒ
็‰น็‚น: ๅ…ณๆณจๆ–นๅ‘๏ผŒๅฟฝ็•ฅ้•ฟๅบฆ
ๆฌงๆฐ่ท็ฆป:
distance = โˆšฮฃ(vโ‚แตข - vโ‚‚แตข)ยฒ
่Œƒๅ›ด: [0, โˆž]๏ผŒ0 ่กจ็คบๅฎŒๅ…จ็›ธๅŒ
็‰น็‚น: ๅ…ณๆณจ็ปๅฏนไฝ็ฝฎๅทฎๅผ‚
ๅฝ’ไธ€ๅŒ–ๅŽ๏ผŒไธค่€…็ญ‰ไปท๏ผ
3. ๆ‰น้‡ๅค„็†ไผ˜ๅŒ–
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ไธๆŽจ่๏ผˆๆ…ข๏ผ‰:
for chunk in chunks:
vector = embed_documents([chunk]) # ๅ•็‹ฌๅค„็†
ๆŽจ่๏ผˆๅฟซ 10 ๅ€๏ผ‰โญ:
vectors = embed_documents(chunks) # ๆ‰น้‡ๅค„็†
โ””โ”€ GPU ๅนถ่กŒ่ฎก็ฎ—
โ””โ”€ ๅ‡ๅฐ‘ๆจกๅž‹ๅŠ ่ฝฝๅผ€้”€
4. ๅ†…ๅญ˜ไผ˜ๅŒ–
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
ๅ‘้‡็ปดๅบฆ้€‰ๆ‹ฉ:
384 ็ปด (all-MiniLM-L6-v2) โ† ไฝ ็š„้กน็›ฎ โญ
โ””โ”€ ๅนณ่กก๏ผšๅ‡†็กฎ็އ vs ๅญ˜ๅ‚จ
768 ็ปด (BERT-base)
โ””โ”€ ๆ›ดๅ‡†็กฎไฝ†ๅญ˜ๅ‚จ็ฟปๅ€
1024 ็ปด (large models)
โ””โ”€ ๆœ€ๅ‡†็กฎไฝ†ๅญ˜ๅ‚จ 3 ๅ€
ๅญ˜ๅ‚จ่ฎก็ฎ—:
20 ไธช chunks ร— 384 ็ปด ร— 4 bytes = 30KB
1000 ไธช chunks ร— 384 ็ปด ร— 4 bytes = 1.5MB
โ””โ”€ ้žๅธธ้ซ˜ๆ•ˆ๏ผ
""")
# ============================================================================
# Part 8: ๅฎŒๆ•ดไปฃ็ ๆต็จ‹
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿ’ป Part 8: ๅฎŒๆ•ดไปฃ็ ๆต็จ‹ๆ€ป็ป“")
print("=" * 80)
print("""
ไฝ ็š„้กน็›ฎๅฎŒๆ•ดๆต็จ‹๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
# 1. ๅˆๅง‹ๅŒ– Embedding ๆจกๅž‹
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={'device': 'cpu'},
encode_kwargs={'normalize_embeddings': True}
)
# 2. ๆ–‡ๆกฃๅˆ‡ๅ‰ฒ
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=250,
chunk_overlap=50 # โ† ไฝ ๅˆšไฟฎๆ”น็š„
)
doc_splits = text_splitter.split_documents(docs)
# 3. ๅ‘้‡ๅŒ– + ๅญ˜ๅ‚จๅˆฐ Chroma
vectorstore = Chroma.from_documents(
documents=doc_splits, # ่พ“ๅ…ฅ: 20 ไธช chunks
collection_name="rag-chroma",
embedding=embeddings # ๅ‘้‡ๅŒ–ๅ‡ฝๆ•ฐ
)
# โ†“ ๅ†…้ƒจ่‡ชๅŠจๅฎŒๆˆ:
# - ๆ‰น้‡ๅ‘้‡ๅŒ–: chunks โ†’ 384็ปดๅ‘้‡
# - ๅญ˜ๅ‚จ: ๅ‘้‡ + ๅŽŸๆ–‡ + ๅ…ƒๆ•ฐๆฎ
# - ๆž„ๅปบ HNSW ็ดขๅผ•
# 4. ๅˆ›ๅปบๆฃ€็ดขๅ™จ
retriever = vectorstore.as_retriever()
# 5. ๆฃ€็ดข
docs = retriever.get_relevant_documents("ไป€ไนˆๆ˜ฏๆœบๅ™จๅญฆไน ๏ผŸ")
# โ†“ ๅ†…้ƒจๆต็จ‹:
# - ๆŸฅ่ฏขๅ‘้‡ๅŒ–
# - HNSW ๅฟซ้€Ÿๆฃ€็ดข
# - ่ฟ”ๅ›ž Top K chunks
# 6. CrossEncoder ้‡ๆŽ’๏ผˆๅฏ้€‰๏ผŒไฝ ็š„้กน็›ฎๆœ‰๏ผ‰
reranked = crossencoder.rerank(query, docs, top_k=5)
# 7. ๅ–‚็ป™ LLM ็”Ÿๆˆ็ญ”ๆกˆ
answer = llm.generate(context=docs, question=query)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
""")
# ============================================================================
# Part 9: ๆ€ง่ƒฝไผ˜ๅŒ–ๅปบ่ฎฎ
# ============================================================================
print("\n" + "=" * 80)
print("๐Ÿš€ Part 9: ๆ€ง่ƒฝไผ˜ๅŒ–ๅปบ่ฎฎ")
print("=" * 80)
print("""
ๅฝ“ๅ‰้…็ฝฎ่ฏ„ๅˆ†๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โœ… Embedding ๆจกๅž‹: all-MiniLM-L6-v2 (่ฝป้‡้ซ˜ๆ•ˆ) โญโญโญโญโญ
โœ… ๅ‘้‡ๅฝ’ไธ€ๅŒ–: True (ไฝ™ๅผฆ็›ธไผผๅบฆไผ˜ๅŒ–) โญโญโญโญโญ
โœ… ็ดขๅผ•็ฑปๅž‹: HNSW (ๅฟซ้€Ÿๆฃ€็ดข) โญโญโญโญโญ
โœ… Chunk overlap: 50 (ไฟๆŒไธŠไธ‹ๆ–‡) โญโญโญโญโญ
โœ… CrossEncoder ้‡ๆŽ’ (็ฒพๅ‡†ๆŽ’ๅบ) โญโญโญโญโญ
ๆ€ป่ฏ„: ๐Ÿ† ็”Ÿไบง็บง้…็ฝฎ๏ผ
ๅฏ้€‰ไผ˜ๅŒ–๏ผˆๅฆ‚้œ€่ฟ›ไธ€ๆญฅๆๅ‡๏ผ‰๏ผš
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
1. GPU ๅŠ ้€Ÿ
model_kwargs={'device': 'cuda'} # ๅ‘้‡ๅŒ–้€Ÿๅบฆ 10x โ†‘
2. ๆ›ดๅคง็š„ Embedding ๆจกๅž‹๏ผˆๅฆ‚้œ€ๆ›ด้ซ˜ๅ‡†็กฎ็އ๏ผ‰
"BAAI/bge-large-en-v1.5" # 1024็ปด๏ผŒๅ‡†็กฎ็އ +5%
3. ๆ‰น้‡ๅคงๅฐ่ฐƒๆ•ด
batch_size=32 # ๅŠ ๅฟซๅ‘้‡ๅŒ–
4. Chroma ๆŒไน…ๅŒ–้…็ฝฎ
persist_directory="./chroma_db" # ้ฟๅ…้‡ๅคๅ‘้‡ๅŒ–
""")
print("\n" + "=" * 80)
print("โœ… ่งฃๆžๅฎŒๆˆ๏ผไฝ ็Žฐๅœจ็†่งฃไบ†ไปŽๅˆ‡ๅ‰ฒๅˆฐๅ‘้‡ๆ•ฐๆฎๅบ“็š„ๅฎŒๆ•ดๆต็จ‹")
print("=" * 80)
print()