Spaces:
Paused
Paused
| """ | |
| ๅ้ๅๅ Chroma ๅญๅจ่ฟ็จ่ฏฆ่งฃ | |
| ไปๅๅฒๅ็ๆๆกฃๅฐๅ้ๆฐๆฎๅบ็ๅฎๆดๆต็จ | |
| """ | |
| print("=" * 80) | |
| print("ๅ้ๅๅ Chroma ๅญๅจ่ฟ็จ่ฏฆ่งฃ") | |
| print("=" * 80) | |
| # ============================================================================ | |
| # Part 1: ๅฎๆดๆต็จๆฆ่ง | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ Part 1: ๅฎๆดๆต็จๆฆ่ง") | |
| print("=" * 80) | |
| print(""" | |
| ไปๆๆกฃๅๅฒๅฐๅ้ๆฐๆฎๅบ็ๅฎๆดๆต็จ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Step 1: ๆๆกฃๅๅฒ | |
| ๅๅงๆๆกฃ โ RecursiveCharacterTextSplitter โ 20 ไธช chunks | |
| (5000 tokens) (ๆฏไธช 250 tokens) | |
| Step 2: ๅ้ๅ (Embedding) | |
| ๆฏไธช chunk โ HuggingFace ๆจกๅ โ ๅ้ (384็ปด) | |
| "ไบบๅทฅๆบ่ฝๆฏ..." โ [0.12, -0.34, 0.56, ...] | |
| Step 3: ๅญๅ ฅ Chroma | |
| ๅ้ + ๅๆ + ๅ ๆฐๆฎ โ Chroma ๆฐๆฎๅบ | |
| โโ ๆไน ๅๅญๅจ | |
| Step 4: ๆๅปบ็ดขๅผ | |
| Chroma โ HNSW ็ดขๅผ โ ๅฟซ้่ฟไผผๆฃ็ดข | |
| (ๅฑๆฌกๅๅพ็ปๆ) | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| """) | |
| # ============================================================================ | |
| # Part 2: Embedding ๆจกๅ่ฏฆ่งฃ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ค Part 2: Embedding ๆจกๅ - HuggingFaceEmbeddings") | |
| print("=" * 80) | |
| print(""" | |
| ไฝ ็้กน็ฎ้ ็ฝฎ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| self.embeddings = HuggingFaceEmbeddings( | |
| model_name="sentence-transformers/all-MiniLM-L6-v2", | |
| model_kwargs={'device': device}, # CPU ๆ GPU | |
| encode_kwargs={'normalize_embeddings': True} # ๅฝไธๅ | |
| ) | |
| ๆจกๅ่ฏดๆ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๆจกๅๅ็งฐ: all-MiniLM-L6-v2 | |
| โโ ็ฑปๅ: Sentence-BERT (ๅ็ผ็ ๅจ) | |
| โโ ๅๆฐ้: 22M (่ฝป้็บง) | |
| โโ ่พๅบ็ปดๅบฆ: 384 ็ปดๅ้ | |
| โโ ่ฎญ็ปๆฐๆฎ: 10ไบฟ+ ๅฅๅญๅฏน | |
| โโ ็น็น: ๅฟซ้ใๅ็กฎใ้ๅ่ฏญไนๆฃ็ดข | |
| ๅทฅไฝๅ็๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ่พๅ ฅๆๆฌ: "ไบบๅทฅๆบ่ฝๆฏ่ฎก็ฎๆบ็งๅญฆ็ไธไธชๅๆฏ" | |
| โ | |
| Tokenization (ๅ่ฏ) | |
| โ | |
| Token IDs: [101, 782, 1435, 1819, 2510, 3221, ...] | |
| โ | |
| BERT Encoder (6 ๅฑ Transformer) | |
| โ | |
| [CLS] Token ็ๅ้่กจ็คบ | |
| โ | |
| 384 ็ปดๅ้: [0.123, -0.456, 0.789, ...] | |
| โ | |
| L2 ๅฝไธๅ (normalize_embeddings=True) | |
| โ | |
| ๆ็ปๅ้: ||v|| = 1 (ๅไฝๅ้) | |
| """) | |
| # ============================================================================ | |
| # Part 3: ๅ้ๅ่ฟ็จๅๆญฅ่งฃๆ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ Part 3: ๅ้ๅ่ฟ็จ - ้ๆญฅ่งฃๆ") | |
| print("=" * 80) | |
| print(""" | |
| ๅ่ฎพๆไปฌๆ 3 ไธช chunks๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Chunk 1: "ไบบๅทฅๆบ่ฝๆฏ่ฎก็ฎๆบ็งๅญฆ็ไธไธชๅๆฏใๅฎ่ดๅไบ..." | |
| Chunk 2: "ๆบๅจๅญฆไน ๆฏไบบๅทฅๆบ่ฝ็ๅญ้ขๅใๅฎไฝฟ่ฎก็ฎๆบ..." | |
| Chunk 3: "ๆทฑๅบฆๅญฆไน ไฝฟ็จๅคๅฑ็ฅ็ป็ฝ็ปๆฅๅค็ๅคๆ็..." | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๅ้ๅ่ฟ็จ๏ผๆน้ๅค็๏ผ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| embeddings.embed_documents([chunk1, chunk2, chunk3]) | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ HuggingFace Embedding ๆจกๅ โ | |
| โ (sentence-transformers/all-MiniLM-L6-v2) โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| ๅ ้จๅค็๏ผๆฏไธช chunk๏ผ๏ผ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Step 1: Tokenization โ | |
| โ "ไบบๅทฅๆบ่ฝ..." โ [101, 782, 1435, ...] โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Step 2: ่ฝฌๆขไธบ Token Embeddings โ | |
| โ Token IDs โ ๅๅงๅ้่กจ็คบ โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Step 3: BERT Encoder (6 ๅฑ) โ | |
| โ Self-Attention + Feed Forward โ | |
| โ ๆฏๅฑๆๅๆดๆทฑๅฑ็่ฏญไน โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Step 4: Mean Pooling โ | |
| โ ๆๆ token ๅ้็ๅนณๅ โ ๅฅๅญๅ้ โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Step 5: L2 Normalization โ | |
| โ ๅ้ๅฝไธๅๅฐๅไฝ้ฟๅบฆ โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| ่พๅบ๏ผ3 ไธชๅ้ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Vector 1: [0.123, -0.456, 0.789, ..., 0.321] (384็ปด) โ | |
| โ Vector 2: [0.234, 0.567, -0.890, ..., 0.432] (384็ปด) โ | |
| โ Vector 3: [-0.345, 0.678, 0.901, ..., -0.543] (384็ปด) โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๅ ณ้ฎ็น๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ๆฏไธช chunk โ 1 ไธชๅบๅฎ็ปดๅบฆ็ๅ้ (384็ปด) | |
| โ ่ฏญไน็ธไผผ็ๆๆฌ โ ๅ้่ท็ฆป่ฟ | |
| โ ๅฝไธๅๅๅฏ็จไฝๅผฆ็ธไผผๅบฆๅฟซ้ๆฏ่พ | |
| """) | |
| # ============================================================================ | |
| # Part 4: Chroma ๆฐๆฎๅบๅญๅจ็ปๆ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐พ Part 4: Chroma ๆฐๆฎๅบๅญๅจ็ปๆ") | |
| print("=" * 80) | |
| print(""" | |
| Chroma.from_documents() ๆง่ก็ๆไฝ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Chroma.from_documents( | |
| documents=doc_splits, # 20 ไธช chunks | |
| collection_name="rag-chroma", # ้ๅๅ็งฐ | |
| embedding=self.embeddings # Embedding ๅฝๆฐ | |
| ) | |
| ๅ ้จๆต็จ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Step 1: ๅๅปบ/ๆๅผ้ๅ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Collection: "rag-chroma" โ | |
| โ ๅ ๆฐๆฎ: embedding_dimension=384 โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Step 2: ๆน้ๅ้ๅ | |
| for chunk in doc_splits: | |
| vector = embeddings.embed_documents([chunk.page_content]) | |
| โ | |
| Step 3: ๅญๅจๆฐๆฎ๏ผๆฏไธช chunk๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ID: "chunk_1" โ | |
| โ โโ Vector: [0.123, -0.456, ..., 0.321] (384็ปด) โ | |
| โ โโ Document: "ไบบๅทฅๆบ่ฝๆฏ่ฎก็ฎๆบ็งๅญฆ็ไธไธชๅๆฏ..." โ | |
| โ โโ Metadata: { โ | |
| โ "source": "https://...", โ | |
| โ "chunk_index": 0, โ | |
| โ "total_chunks": 20 โ | |
| โ } โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค | |
| โ ID: "chunk_2" โ | |
| โ โโ Vector: [0.234, 0.567, ..., 0.432] โ | |
| โ โโ Document: "ๆบๅจๅญฆไน ๆฏไบบๅทฅๆบ่ฝ็ๅญ้ขๅ..." โ | |
| โ โโ Metadata: {...} โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค | |
| โ ID: "chunk_3" โ | |
| โ โโ Vector: [-0.345, 0.678, ..., -0.543] โ | |
| โ โโ Document: "ๆทฑๅบฆๅญฆไน ไฝฟ็จๅคๅฑ็ฅ็ป็ฝ็ป..." โ | |
| โ โโ Metadata: {...} โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Step 4: ๆๅปบ HNSW ็ดขๅผ | |
| ๅ้ โ HNSW ๅพ็ปๆ โ ๅฟซ้ๆฃ็ดข | |
| (ๅฑๆฌกๅๅฏผ่ชๅฐไธ็ๅพ) | |
| ๅญๅจไฝ็ฝฎ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ้ป่ฎค่ทฏๅพ: ./chroma/ (ๆฌๅฐ็ฎๅฝ) | |
| โโ collections/ | |
| โ โโ rag-chroma/ | |
| โ โโ data.parquet # ๅ้ๆฐๆฎ | |
| โ โโ metadata.json # ๅ ๆฐๆฎ | |
| โ โโ index.bin # HNSW ็ดขๅผ | |
| โโ chroma.sqlite3 # SQLite ๆฐๆฎๅบ | |
| """) | |
| # ============================================================================ | |
| # Part 5: HNSW ็ดขๅผๅทฅไฝๅ็ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ Part 5: HNSW ็ดขๅผ - ๅฟซ้ๆฃ็ดข็็งๅฏ") | |
| print("=" * 80) | |
| print(""" | |
| HNSW = Hierarchical Navigable Small World | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ไธบไปไน้่ฆ็ดขๅผ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๆดๅๆ็ดข: O(n) - ่ฎก็ฎๆฅ่ฏขๅ้ไธๆๆๅ้็่ท็ฆป | |
| โโ 10000 ไธชๅ้ โ ้่ฆ่ฎก็ฎ 10000 ๆฌก่ท็ฆป | |
| โโ ๅคชๆ ข๏ผ | |
| HNSW ็ดขๅผ: O(log n) - ๅฑๆฌกๅๅพ็ปๆๅฏผ่ช | |
| โโ 10000 ไธชๅ้ โ ๅช้ๆฃๆฅ็บฆ 20-30 ไธช่็น | |
| โโ ๅฟซ 100+ ๅ๏ผ | |
| HNSW ็ปๆ๏ผ็ฎๅ็คบไพ๏ผ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Layer 2 (ๆ็จ็) | |
| Vโ โโโโโโโโ Vโ โโโโโโโโ Vโโ | |
| โ โ โ | |
| Layer 1 | |
| Vโ โโ Vโ โโ Vโ โโ Vโ โโ Vโโ | |
| โ โ โ โ โ | |
| Layer 0 (ๆๅฏ้) | |
| Vโ โ Vโ โ Vโ โ Vโ โ Vโ โ Vโ โ ... โ Vโโ | |
| ๆๆๅ้้ฝๅจ่ฟไธๅฑ | |
| ๆฃ็ดข่ฟ็จ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๆฅ่ฏขๅ้: Q = [0.2, -0.3, 0.5, ...] | |
| Step 1: ไป Layer 2 ๅผๅง๏ผ็ฒ็ฅๆ็ดข๏ผ | |
| ๅ ฅๅฃ็น: Vโ | |
| โ ่ฎก็ฎ dist(Q, Vโ), dist(Q, Vโ ), dist(Q, Vโโ) | |
| โ Vโ ๆ่ฟ โ ่ทณๅฐ Vโ | |
| Step 2: ไธ้ๅฐ Layer 1๏ผไธญ็ญ็ฒพๅบฆ๏ผ | |
| ไป Vโ ๅผๅง | |
| โ ๆฃๆฅ้ปๅฑ Vโ, Vโ | |
| โ Vโ ๆ่ฟ โ ่ทณๅฐ Vโ | |
| Step 3: ไธ้ๅฐ Layer 0๏ผ้ซ็ฒพๅบฆ๏ผ | |
| ไป Vโ ๅผๅง | |
| โ ๆฃๆฅๆๆ้ปๅฑ | |
| โ ๆพๅฐๆ่ฟ็ K ไธชๅ้ | |
| ่ฟๅ็ปๆ: Top K ๆ็ธไผผ็ chunks | |
| ้ๅบฆๅฏนๆฏ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๆดๅๆ็ดข: 10000 ๆฌก่ท็ฆป่ฎก็ฎ โ 100ms | |
| HNSW ็ดขๅผ: 20-30 ๆฌก่ท็ฆป่ฎก็ฎ โ 1ms โ ๅฟซ 100 ๅ๏ผ | |
| """) | |
| # ============================================================================ | |
| # Part 6: ๆฃ็ดข่ฟ็จ่ฏฆ่งฃ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ Part 6: ๆฃ็ดข่ฟ็จ - ไปๆฅ่ฏขๅฐ็ปๆ") | |
| print("=" * 80) | |
| print(""" | |
| ็จๆทๆฅ่ฏข: "ไปไนๆฏๆบๅจๅญฆไน ๏ผ" | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Step 1: ๆฅ่ฏขๅ้ๅ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| "ไปไนๆฏๆบๅจๅญฆไน ๏ผ" | |
| โ | |
| embeddings.embed_query("ไปไนๆฏๆบๅจๅญฆไน ๏ผ") | |
| โ | |
| Query Vector: [0.345, -0.678, 0.234, ...] (384็ปด) | |
| Step 2: HNSW ่ฟไผผๆ็ดข | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| vectorstore.similarity_search( | |
| query="ไปไนๆฏๆบๅจๅญฆไน ๏ผ", | |
| k=20 # ่ฟๅ Top 20 | |
| ) | |
| โ | |
| Chroma ๅ ้จ: | |
| 1. ๆฅ่ฏขๅ้ๅ | |
| 2. HNSW ๅพๅฏผ่ช | |
| 3. ่ฎก็ฎไฝๅผฆ็ธไผผๅบฆ | |
| โ | |
| ่ฟๅ Top 20 chunks: | |
| โโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Chunk ID โ Score โ Content โ | |
| โโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค | |
| โ chunk_5 โ 0.92 โ "ๆบๅจๅญฆไน ๆฏไบบๅทฅๆบ่ฝ็..." โ | |
| โ chunk_2 โ 0.88 โ "ไบบๅทฅๆบ่ฝๅ ๆฌๆบๅจๅญฆไน ..." โ | |
| โ chunk_11 โ 0.85 โ "็็ฃๅญฆไน ๆฏๆบๅจๅญฆไน ..." โ | |
| โ ... โ ... โ ... โ | |
| โโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Step 3: CrossEncoder ้ๆ๏ผไฝ ็้กน็ฎ็น่ฒ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| reranker.rerank(query, top_20_chunks, top_k=5) | |
| โ | |
| ๆฏไธช chunk ้ๆฐๆๅ๏ผๆทฑๅบฆไบคไบ๏ผ | |
| โ | |
| ๆ็ป Top 5: | |
| โโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Chunk ID โ Score โ Content โ | |
| โโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค | |
| โ chunk_5 โ 8.45 โ "ๆบๅจๅญฆไน ๆฏไบบๅทฅๆบ่ฝ็..." โ | |
| โ chunk_11 โ 7.89 โ "็็ฃๅญฆไน ๆฏๆบๅจๅญฆไน ..." โ | |
| โ chunk_2 โ 7.23 โ "ไบบๅทฅๆบ่ฝๅ ๆฌๆบๅจๅญฆไน ..." โ | |
| โ chunk_14 โ 6.78 โ "ๆทฑๅบฆๅญฆไน ๆฏๆบๅจๅญฆไน ..." โ | |
| โ chunk_8 โ 6.12 โ "ๅผบๅๅญฆไน ๅ ่ฎธ..." โ | |
| โโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Step 4: ่ฟๅ็ป LLM | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| context = "\\n\\n".join([chunk.page_content for chunk in top_5]) | |
| โ | |
| LLM ็ๆ็ญๆก | |
| """) | |
| # ============================================================================ | |
| # Part 7: ๅ ณ้ฎๆๆฏ็ป่ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("โ๏ธ Part 7: ๅ ณ้ฎๆๆฏ็ป่") | |
| print("=" * 80) | |
| print(""" | |
| 1. ไธบไปไน่ฆๅฝไธๅๅ้๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| encode_kwargs={'normalize_embeddings': True} | |
| ๅๅงๅ้: [1.23, -4.56, 7.89, ...] # ้ฟๅบฆไธไธ | |
| ๅฝไธๅๅ: [0.12, -0.45, 0.78, ...] # ้ฟๅบฆ = 1 | |
| ๅฅฝๅค: | |
| โ ไฝๅผฆ็ธไผผๅบฆ = ็น็งฏ๏ผ่ฎก็ฎๆดๅฟซ๏ผ | |
| โ ๆๆๅ้ๅจๅไธๅฐบๅบฆไธ | |
| โ ้ฟๅ ้ฟๅบฆๅฝฑๅ็ธไผผๅบฆ่ฎก็ฎ | |
| 2. ไฝๅผฆ็ธไผผๅบฆ vs ๆฌงๆฐ่ท็ฆป | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ไฝๅผฆ็ธไผผๅบฆ๏ผไฝ ็้กน็ฎไฝฟ็จ๏ผโญ: | |
| similarity = vโ ยท vโ / (||vโ|| ร ||vโ||) | |
| ่ๅด: [-1, 1]๏ผ1 ่กจ็คบๅฎๅ จ็ธๅ | |
| ็น็น: ๅ ณๆณจๆนๅ๏ผๅฟฝ็ฅ้ฟๅบฆ | |
| ๆฌงๆฐ่ท็ฆป: | |
| distance = โฮฃ(vโแตข - vโแตข)ยฒ | |
| ่ๅด: [0, โ]๏ผ0 ่กจ็คบๅฎๅ จ็ธๅ | |
| ็น็น: ๅ ณๆณจ็ปๅฏนไฝ็ฝฎๅทฎๅผ | |
| ๅฝไธๅๅ๏ผไธค่ ็ญไปท๏ผ | |
| 3. ๆน้ๅค็ไผๅ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ไธๆจ่๏ผๆ ข๏ผ: | |
| for chunk in chunks: | |
| vector = embed_documents([chunk]) # ๅ็ฌๅค็ | |
| ๆจ่๏ผๅฟซ 10 ๅ๏ผโญ: | |
| vectors = embed_documents(chunks) # ๆน้ๅค็ | |
| โโ GPU ๅนถ่ก่ฎก็ฎ | |
| โโ ๅๅฐๆจกๅๅ ่ฝฝๅผ้ | |
| 4. ๅ ๅญไผๅ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๅ้็ปดๅบฆ้ๆฉ: | |
| 384 ็ปด (all-MiniLM-L6-v2) โ ไฝ ็้กน็ฎ โญ | |
| โโ ๅนณ่กก๏ผๅ็กฎ็ vs ๅญๅจ | |
| 768 ็ปด (BERT-base) | |
| โโ ๆดๅ็กฎไฝๅญๅจ็ฟปๅ | |
| 1024 ็ปด (large models) | |
| โโ ๆๅ็กฎไฝๅญๅจ 3 ๅ | |
| ๅญๅจ่ฎก็ฎ: | |
| 20 ไธช chunks ร 384 ็ปด ร 4 bytes = 30KB | |
| 1000 ไธช chunks ร 384 ็ปด ร 4 bytes = 1.5MB | |
| โโ ้ๅธธ้ซๆ๏ผ | |
| """) | |
| # ============================================================================ | |
| # Part 8: ๅฎๆดไปฃ็ ๆต็จ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ป Part 8: ๅฎๆดไปฃ็ ๆต็จๆป็ป") | |
| print("=" * 80) | |
| print(""" | |
| ไฝ ็้กน็ฎๅฎๆดๆต็จ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # 1. ๅๅงๅ Embedding ๆจกๅ | |
| embeddings = HuggingFaceEmbeddings( | |
| model_name="sentence-transformers/all-MiniLM-L6-v2", | |
| model_kwargs={'device': 'cpu'}, | |
| encode_kwargs={'normalize_embeddings': True} | |
| ) | |
| # 2. ๆๆกฃๅๅฒ | |
| text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( | |
| chunk_size=250, | |
| chunk_overlap=50 # โ ไฝ ๅไฟฎๆน็ | |
| ) | |
| doc_splits = text_splitter.split_documents(docs) | |
| # 3. ๅ้ๅ + ๅญๅจๅฐ Chroma | |
| vectorstore = Chroma.from_documents( | |
| documents=doc_splits, # ่พๅ ฅ: 20 ไธช chunks | |
| collection_name="rag-chroma", | |
| embedding=embeddings # ๅ้ๅๅฝๆฐ | |
| ) | |
| # โ ๅ ้จ่ชๅจๅฎๆ: | |
| # - ๆน้ๅ้ๅ: chunks โ 384็ปดๅ้ | |
| # - ๅญๅจ: ๅ้ + ๅๆ + ๅ ๆฐๆฎ | |
| # - ๆๅปบ HNSW ็ดขๅผ | |
| # 4. ๅๅปบๆฃ็ดขๅจ | |
| retriever = vectorstore.as_retriever() | |
| # 5. ๆฃ็ดข | |
| docs = retriever.get_relevant_documents("ไปไนๆฏๆบๅจๅญฆไน ๏ผ") | |
| # โ ๅ ้จๆต็จ: | |
| # - ๆฅ่ฏขๅ้ๅ | |
| # - HNSW ๅฟซ้ๆฃ็ดข | |
| # - ่ฟๅ Top K chunks | |
| # 6. CrossEncoder ้ๆ๏ผๅฏ้๏ผไฝ ็้กน็ฎๆ๏ผ | |
| reranked = crossencoder.rerank(query, docs, top_k=5) | |
| # 7. ๅ็ป LLM ็ๆ็ญๆก | |
| answer = llm.generate(context=docs, question=query) | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| """) | |
| # ============================================================================ | |
| # Part 9: ๆง่ฝไผๅๅปบ่ฎฎ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ Part 9: ๆง่ฝไผๅๅปบ่ฎฎ") | |
| print("=" * 80) | |
| print(""" | |
| ๅฝๅ้ ็ฝฎ่ฏๅ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Embedding ๆจกๅ: all-MiniLM-L6-v2 (่ฝป้้ซๆ) โญโญโญโญโญ | |
| โ ๅ้ๅฝไธๅ: True (ไฝๅผฆ็ธไผผๅบฆไผๅ) โญโญโญโญโญ | |
| โ ็ดขๅผ็ฑปๅ: HNSW (ๅฟซ้ๆฃ็ดข) โญโญโญโญโญ | |
| โ Chunk overlap: 50 (ไฟๆไธไธๆ) โญโญโญโญโญ | |
| โ CrossEncoder ้ๆ (็ฒพๅๆๅบ) โญโญโญโญโญ | |
| ๆป่ฏ: ๐ ็ไบง็บง้ ็ฝฎ๏ผ | |
| ๅฏ้ไผๅ๏ผๅฆ้่ฟไธๆญฅๆๅ๏ผ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| 1. GPU ๅ ้ | |
| model_kwargs={'device': 'cuda'} # ๅ้ๅ้ๅบฆ 10x โ | |
| 2. ๆดๅคง็ Embedding ๆจกๅ๏ผๅฆ้ๆด้ซๅ็กฎ็๏ผ | |
| "BAAI/bge-large-en-v1.5" # 1024็ปด๏ผๅ็กฎ็ +5% | |
| 3. ๆน้ๅคงๅฐ่ฐๆด | |
| batch_size=32 # ๅ ๅฟซๅ้ๅ | |
| 4. Chroma ๆไน ๅ้ ็ฝฎ | |
| persist_directory="./chroma_db" # ้ฟๅ ้ๅคๅ้ๅ | |
| """) | |
| print("\n" + "=" * 80) | |
| print("โ ่งฃๆๅฎๆ๏ผไฝ ็ฐๅจ็่งฃไบไปๅๅฒๅฐๅ้ๆฐๆฎๅบ็ๅฎๆดๆต็จ") | |
| print("=" * 80) | |
| print() | |