Spaces:
Paused
Paused
| """ | |
| ๆๅญ่ฝฌๅ้็ๅ ทไฝๅฎ็ฐๆญฅ้ชค๏ผไปฃ็ ๅฑ้ข๏ผ | |
| ๅฑ็คบ HuggingFace Embeddings ๅ ้จ็ๅฎ้ ๆไฝ | |
| """ | |
| print("=" * 80) | |
| print("ๆๅญ โ ๅ้็ๅ ทไฝๅฎ็ฐๆญฅ้ชค") | |
| print("=" * 80) | |
| # ============================================================================ | |
| # ๅๅคๅทฅไฝ๏ผๆจกๆๅฎๆด็ๅ้ๅ่ฟ็จ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ง ๅๅค๏ผๅฎ่ฃ ๅๅฏผๅ ฅ้่ฆ็ๅบ") | |
| print("=" * 80) | |
| print(""" | |
| ้่ฆ็ๅบ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| pip install transformers torch sentence-transformers | |
| ๅฏผๅ ฅ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| from transformers import AutoTokenizer, AutoModel | |
| import torch | |
| import numpy as np | |
| """) | |
| # ============================================================================ | |
| # Step 1: ๅ ่ฝฝๆจกๅๅๅ่ฏๅจ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("Step 1: ๅ ่ฝฝ้ข่ฎญ็ปๆจกๅๅๅ่ฏๅจ") | |
| print("=" * 80) | |
| print(""" | |
| ไปฃ็ ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| from transformers import AutoTokenizer, AutoModel | |
| model_name = "sentence-transformers/all-MiniLM-L6-v2" | |
| # 1. ๅ ่ฝฝๅ่ฏๅจ๏ผ่ด่ดฃๆๅญ โ ID๏ผ | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| # 2. ๅ ่ฝฝๆจกๅ๏ผ่ด่ดฃ ID โ ๅ้๏ผ | |
| model = AutoModel.from_pretrained(model_name) | |
| model.eval() # ่ฎพ็ฝฎไธบ่ฏไผฐๆจกๅผ๏ผไธ่ฎญ็ป๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ่ฟไธคไธชไธ่ฅฟๅไปไน๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Tokenizer๏ผๅ่ฏๅจ๏ผ๏ผ | |
| โโ ่ฏๆฑ่กจ๏ผvocabulary๏ผ๏ผ30,000+ ไธช่ฏ | |
| โ ไพๅฆ๏ผ{"hello": 1, "world": 2, "machine": 3456, ...} | |
| โโ ๅ่ฏ่งๅ๏ผๅฆไฝๅๅๆๅญ | |
| Model๏ผๆจกๅ๏ผ๏ผ | |
| โโ Embedding ๅฑ๏ผ่ฏๆฑ่กจ โ ๅๅงๅ้ | |
| โ 30,000 ร 384 ็็ฉ้ต๏ผๆฏไธช่ฏๅฏนๅบไธไธช 384 ็ปดๅ้๏ผ | |
| โโ Transformer ๅฑ๏ผ6 ๅฑ BERT encoder | |
| โ ๆฏๅฑ้ฝๆ Self-Attention + Feed Forward | |
| โโ ๅๆฐ้๏ผ22M๏ผ2200ไธไธชๆฐๅญ๏ผ | |
| """) | |
| # ============================================================================ | |
| # Step 2: ๅ่ฏ๏ผTokenization๏ผ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("Step 2: ๅ่ฏ - ๆๅญ่ฝฌไธบ Token IDs") | |
| print("=" * 80) | |
| print(""" | |
| ่พๅ ฅๆๆฌ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| text = "Machine learning is a subset of artificial intelligence" | |
| ไปฃ็ ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # ๅ่ฏๅนถ่ฝฌๆขไธบๆจกๅ่พๅ ฅๆ ผๅผ | |
| encoded_input = tokenizer( | |
| text, | |
| padding=True, # ๅกซๅ ๅฐ็ธๅ้ฟๅบฆ | |
| truncation=True, # ่ถ ้ฟๆชๆญ | |
| max_length=512, # ๆๅคง้ฟๅบฆ | |
| return_tensors='pt' # ่ฟๅ PyTorch tensor | |
| ) | |
| print(encoded_input) | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ่พๅบ๏ผencoded_input ๅ ๅซ๏ผ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| { | |
| 'input_ids': tensor([[ | |
| 101, # [CLS] ็นๆฎๆ ่ฎฐ | |
| 3698, # "machine" | |
| 4083, # "learning" | |
| 2003, # "is" | |
| 1037, # "a" | |
| 2042, # "subset" | |
| 1997, # "of" | |
| 7976, # "artificial" | |
| 4454, # "intelligence" | |
| 102 # [SEP] ็นๆฎๆ ่ฎฐ | |
| ]]), | |
| 'attention_mask': tensor([[ | |
| 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 # ๆๆไฝ็ฝฎ้ฝๆๆ๏ผ1่กจ็คบๅ ณๆณจ๏ผ0่กจ็คบๅฟฝ็ฅ๏ผ | |
| ]]) | |
| } | |
| ่ฏฆ็ป่งฃ้๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| input_ids: | |
| ๆฏไธชๆฐๅญๅฏนๅบไธไธช่ฏ | |
| 101 = [CLS]๏ผๅฅๅญๅผๅงๆ ่ฎฐ๏ผ | |
| 3698 = "machine" | |
| 102 = [SEP]๏ผๅฅๅญ็ปๆๆ ่ฎฐ๏ผ | |
| attention_mask: | |
| ๅ่ฏๆจกๅๅชไบไฝ็ฝฎๆฏ็ๅฎๅ ๅฎน๏ผ1๏ผ๏ผๅชไบๆฏๅกซๅ ๏ผ0๏ผ | |
| ไพๅฆ๏ผ[1, 1, 1, 0, 0] ่กจ็คบๅ3ไธชๆฏ็ๅฎ่ฏ๏ผๅ2ไธชๆฏๅกซๅ | |
| """) | |
| # ============================================================================ | |
| # Step 3: ้่ฟ Embedding ๅฑ่ทๅๅๅงๅ้ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("Step 3: Token IDs โ ๅๅงๅ้๏ผEmbedding ๅฑ๏ผ") | |
| print("=" * 80) | |
| print(""" | |
| ่ฟไธๆญฅๅ็ๅจๆจกๅๅ ้จ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| input_ids = [101, 3698, 4083, 2003, ...] | |
| โ | |
| Embedding ่กจๆฅ่ฏข | |
| โ | |
| Embedding ่กจ๏ผ็ฎๅ๏ผ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ่ฟๆฏไธไธชๅทจๅคง็็ฉ้ต๏ผ30,522 ร 384 | |
| ๏ผ30,522 ๆฏ่ฏๆฑ่กจๅคงๅฐ๏ผ384 ๆฏๅ้็ปดๅบฆ๏ผ | |
| ID | ็ฌฌ1็ปด ็ฌฌ2็ปด ็ฌฌ3็ปด ... ็ฌฌ384็ปด | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| 101 | 0.12 -0.34 0.56 ... 0.78 โ [CLS] | |
| 3698 | 0.23 0.45 -0.67 ... 0.89 โ "machine" | |
| 4083 | 0.34 -0.56 0.78 ... -0.90 โ "learning" | |
| 2003 | 0.45 0.67 -0.89 ... 0.12 โ "is" | |
| ... | |
| ๆฅ่ฏข่ฟ็จ๏ผ็ฑปไผผๅญๅ ธๆฅ่ฏข๏ผ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ID 101 โ ๆฅ่กจ โ [0.12, -0.34, 0.56, ..., 0.78] | |
| ID 3698 โ ๆฅ่กจ โ [0.23, 0.45, -0.67, ..., 0.89] | |
| ID 4083 โ ๆฅ่กจ โ [0.34, -0.56, 0.78, ..., -0.90] | |
| ... | |
| ็ปๆ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| token_embeddings = [ | |
| [0.12, -0.34, 0.56, ..., 0.78], # [CLS] | |
| [0.23, 0.45, -0.67, ..., 0.89], # "machine" | |
| [0.34, -0.56, 0.78, ..., -0.90], # "learning" | |
| [0.45, 0.67, -0.89, ..., 0.12], # "is" | |
| ... | |
| ] | |
| ๅฝข็ถ๏ผ(10, 384) # 10 ไธช tokens๏ผๆฏไธช 384 ็ปด | |
| โ ๏ธ ๆณจๆ๏ผ่ฟไบ่ฟไธๆฏๆ็ปๅ้๏ผ้่ฆ้่ฟ Transformer ๅค็๏ผ | |
| """) | |
| # ============================================================================ | |
| # Step 4: Transformer ๅค็๏ผๆ ธๅฟ๏ผ๏ผ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("Step 4: Transformer ๅค็ - Self-Attention๏ผๆ ธๅฟๆญฅ้ชค๏ผ") | |
| print("=" * 80) | |
| print(""" | |
| ไปฃ็ ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| with torch.no_grad(): # ไธ่ฎก็ฎๆขฏๅบฆ๏ผไธ่ฎญ็ป๏ผ | |
| outputs = model(**encoded_input) | |
| # outputs.last_hidden_state ๅฐฑๆฏ Transformer ็่พๅบ | |
| token_embeddings = outputs.last_hidden_state | |
| print(token_embeddings.shape) # torch.Size([1, 10, 384]) | |
| # ๆนๆฌก tokens ็ปดๅบฆ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Transformer ๅ ้จๅไบไปไน๏ผ๏ผ6 ๅฑๅค็๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ่พๅ ฅ๏ผๅๅง embeddings | |
| [CLS]: [0.12, -0.34, 0.56, ...] | |
| machine: [0.23, 0.45, -0.67, ...] | |
| learning: [0.34, -0.56, 0.78, ...] | |
| is: [0.45, 0.67, -0.89, ...] | |
| ... | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Layer 1: Self-Attention โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ | |
| โ ๆฏไธช่ฏ"็"ๅ ถไปๆๆ่ฏ๏ผๆดๆฐ่ชๅทฑ็ๅ้๏ผ โ | |
| โ โ | |
| โ "machine" ็ๅฐ "learning" โ ็่งฃ่ฟๆฏไธไธช่ฏ็ป โ | |
| โ "learning" ็ๅฐ "artificial" โ ็่งฃไธAI็ธๅ ณ โ | |
| โ "is" ็ๅฐๅๅ่ฏ โ ็่งฃๆฏ่ฟๆฅ่ฏ โ | |
| โ โ | |
| โ ๆดๆฐๅ็ๅ้ๅ ๅซไบไธไธๆไฟกๆฏ โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Layer 2: Self-Attention โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ ็ปง็ปญๆทฑๅ็่งฃ... โ | |
| โ "machine learning" ไฝไธบๆดไฝ็่งฃ โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| ... (Layer 3, 4, 5) ... | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Layer 6: Self-Attention (ๆๅไธๅฑ) โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ ๆฏไธช่ฏ็ๅ้็ฐๅจๅ ๅซไบ๏ผ โ | |
| โ - ่ชๅทฑ็่ฏญไน โ | |
| โ - ไธไธๆไฟกๆฏ โ | |
| โ - ๆดไธชๅฅๅญ็ๅซไน โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| ๆ็ป่พๅบ๏ผ | |
| [CLS]: [0.234, 0.567, -0.890, ...] # ๆดๆฐๅ๏ผๅ ๅซๅ จๅฅไฟกๆฏ | |
| machine: [0.345, -0.678, 0.123, ...] # ๅ ๅซ "learning" ็ไฟกๆฏ | |
| learning: [0.456, 0.789, -0.234, ...] # ๅ ๅซ "machine" ็ไฟกๆฏ | |
| ... | |
| ๅฝข็ถ๏ผ(1, 10, 384) | |
| ๆนๆฌก tokens ็ปดๅบฆ | |
| """) | |
| # ============================================================================ | |
| # Step 5: Mean Pooling - ๅๅนถๆไธไธชๅฅๅญๅ้ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("Step 5: Mean Pooling - ๆๅคไธช่ฏๅ้ๅๅนถๆไธไธชๅฅๅญๅ้") | |
| print("=" * 80) | |
| print(""" | |
| ้ฎ้ข๏ผ็ฐๅจๆ 10 ไธช่ฏ๏ผๆฏไธช่ฏไธไธชๅ้ | |
| ๅฆไฝๅๆ 1 ไธชๅฅๅญๅ้๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ไปฃ็ ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| def mean_pooling(token_embeddings, attention_mask): | |
| \"\"\" | |
| ๅฏนๆๆ่ฏๅ้ๆฑๅนณๅ๏ผ่่ attention_mask๏ผ | |
| \"\"\" | |
| # token_embeddings: (1, 10, 384) | |
| # attention_mask: (1, 10) | |
| # ๆฉๅฑ mask ็็ปดๅบฆไปฅๅน้ embeddings | |
| # (1, 10) โ (1, 10, 1) โ (1, 10, 384) | |
| input_mask_expanded = attention_mask.unsqueeze(-1).expand( | |
| token_embeddings.size() | |
| ).float() | |
| # ๅฐ embeddings ไธ mask ็ธไน๏ผๅฟฝ็ฅๅกซๅ ้จๅ๏ผ | |
| # ็ถๅๅฏนๆๆ่ฏๆฑๅ | |
| sum_embeddings = torch.sum( | |
| token_embeddings * input_mask_expanded, | |
| dim=1 # ๅจ token ็ปดๅบฆๆฑๅ | |
| ) | |
| # ่ฎก็ฎๆๆ token ็ๆฐ้ | |
| sum_mask = torch.clamp( | |
| input_mask_expanded.sum(dim=1), | |
| min=1e-9 # ้ฟๅ ้ค้ถ | |
| ) | |
| # ๆฑๅนณๅ | |
| mean_embeddings = sum_embeddings / sum_mask | |
| return mean_embeddings | |
| # ไฝฟ็จ | |
| sentence_embedding = mean_pooling( | |
| token_embeddings, | |
| encoded_input['attention_mask'] | |
| ) | |
| print(sentence_embedding.shape) # torch.Size([1, 384]) | |
| # ๆนๆฌก ็ปดๅบฆ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๅ ทไฝ่ฎก็ฎ๏ผ็ฎๅ็คบไพ๏ผ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| 10 ไธช่ฏๅ้๏ผๆฏไธช 384 ็ปด๏ผ | |
| Token 1: [0.234, 0.567, -0.890, ..., 0.123] | |
| Token 2: [0.345, -0.678, 0.123, ..., 0.234] | |
| Token 3: [0.456, 0.789, -0.234, ..., 0.345] | |
| ... | |
| Token 10: [0.567, 0.890, 0.345, ..., 0.456] | |
| ๆฑๅนณๅ๏ผๅฏนๆฏไธ็ปดๅๅซๅนณๅ๏ผ๏ผ | |
| ็ฌฌ1็ปด: (0.234 + 0.345 + 0.456 + ... + 0.567) / 10 = 0.412 | |
| ็ฌฌ2็ปด: (0.567 - 0.678 + 0.789 + ... + 0.890) / 10 = 0.523 | |
| ็ฌฌ3็ปด: (-0.890 + 0.123 - 0.234 + ... + 0.345) / 10 = -0.089 | |
| ... | |
| ็ฌฌ384็ปด: (0.123 + 0.234 + 0.345 + ... + 0.456) / 10 = 0.289 | |
| ๅฅๅญๅ้ = [0.412, 0.523, -0.089, ..., 0.289] (384็ปด) | |
| """) | |
| # ============================================================================ | |
| # Step 6: ๅฝไธๅ๏ผNormalization๏ผ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("Step 6: L2 ๅฝไธๅ - ๅฐๅ้้ฟๅบฆ็ผฉๆพๅฐ 1") | |
| print("=" * 80) | |
| print(""" | |
| ไปฃ็ ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| import torch.nn.functional as F | |
| # L2 ๅฝไธๅ | |
| sentence_embedding = F.normalize( | |
| sentence_embedding, | |
| p=2, # L2 ่ๆฐ | |
| dim=1 # ๅจ็นๅพ็ปดๅบฆๅฝไธๅ | |
| ) | |
| print(sentence_embedding.shape) # torch.Size([1, 384]) | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๅฝไธๅ็ไฝ็จ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๅฝไธๅๅ็ๅ้๏ผ | |
| v = [0.412, 0.523, -0.089, ..., 0.289] | |
| ้ฟๅบฆ ||v|| = โ(0.412ยฒ + 0.523ยฒ + ... + 0.289ยฒ) = 2.37 | |
| ๅฝไธๅๅ็ๅ้๏ผ | |
| v_norm = v / ||v|| | |
| v_norm = [0.412/2.37, 0.523/2.37, ..., 0.289/2.37] | |
| = [0.174, 0.221, -0.038, ..., 0.122] | |
| ้ฟๅบฆ ||v_norm|| = 1 โ | |
| ๅฅฝๅค๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ๆๆๅ้้ฟๅบฆ็ธๅ๏ผ้ฝๆฏ1๏ผ๏ผๆนไพฟๆฏ่พ | |
| โ ไฝๅผฆ็ธไผผๅบฆ = ็น็งฏ๏ผ่ฎก็ฎๆดๅฟซ๏ผ | |
| cos_sim(a, b) = aยทb / (||a|| ร ||b||) | |
| ๅฆๆๅฝไธๅ: cos_sim(a, b) = aยทb โ ็ฎๅไบ๏ผ | |
| โ ๆถ้คๅ้้ฟๅบฆ็ๅฝฑๅ๏ผๅชๅ ณๆณจๆนๅ | |
| """) | |
| # ============================================================================ | |
| # Step 7: ๆ็ป่พๅบ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("Step 7: ๅพๅฐๆ็ป็ๅฅๅญๅ้") | |
| print("=" * 80) | |
| print(""" | |
| ๆ็ป็ปๆ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # ่ฝฌๆขไธบ numpy ๆฐ็ป๏ผๆนไพฟไฝฟ็จ๏ผ | |
| final_vector = sentence_embedding.cpu().numpy()[0] | |
| print(final_vector.shape) # (384,) | |
| print(final_vector[:5]) # ๅ5ไธชๆฐๅญ | |
| # [0.174, 0.221, -0.038, 0.095, 0.312] | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ่ฟๅฐฑๆฏๆ็ป็ๅฅๅญๅ้๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ่พๅ ฅ: "Machine learning is a subset of artificial intelligence" | |
| ่พๅบ: [0.174, 0.221, -0.038, ..., 0.122] (384 ไธชๆฐๅญ) | |
| ่ฟไธชๅ้ๅ ๅซไบ๏ผ | |
| โ ๆฏไธช่ฏ็่ฏญไน | |
| โ ่ฏไธ่ฏไน้ด็ๅ ณ็ณป | |
| โ ๆดไธชๅฅๅญ็ๅซไน | |
| ๅฏไปฅ็จๆฅ๏ผ | |
| โ ่ฎก็ฎไธๅ ถไปๅฅๅญ็็ธไผผๅบฆ | |
| โ ๅญๅ ฅๅ้ๆฐๆฎๅบ | |
| โ ่ฟ่ก่ฏญไนๆ็ดข | |
| """) | |
| # ============================================================================ | |
| # ๅฎๆดไปฃ็ ๆฑๆป | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ ๅฎๆดไปฃ็ ๆฑๆป๏ผๅฎ้ ๅฏ่ฟ่ก๏ผ") | |
| print("=" * 80) | |
| print(""" | |
| from transformers import AutoTokenizer, AutoModel | |
| import torch | |
| import torch.nn.functional as F | |
| def text_to_vector(text): | |
| \"\"\" | |
| ๅฎๆด็ๆๅญ่ฝฌๅ้ๆต็จ | |
| \"\"\" | |
| # Step 1: ๅ ่ฝฝๆจกๅ | |
| model_name = "sentence-transformers/all-MiniLM-L6-v2" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModel.from_pretrained(model_name) | |
| model.eval() | |
| # Step 2: ๅ่ฏ | |
| encoded_input = tokenizer( | |
| text, | |
| padding=True, | |
| truncation=True, | |
| max_length=512, | |
| return_tensors='pt' | |
| ) | |
| # Step 3 & 4: ้่ฟๆจกๅ๏ผEmbedding + Transformer๏ผ | |
| with torch.no_grad(): | |
| outputs = model(**encoded_input) | |
| token_embeddings = outputs.last_hidden_state | |
| # Step 5: Mean Pooling | |
| attention_mask = encoded_input['attention_mask'] | |
| input_mask_expanded = attention_mask.unsqueeze(-1).expand( | |
| token_embeddings.size() | |
| ).float() | |
| sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1) | |
| sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9) | |
| sentence_embedding = sum_embeddings / sum_mask | |
| # Step 6: ๅฝไธๅ | |
| sentence_embedding = F.normalize(sentence_embedding, p=2, dim=1) | |
| # Step 7: ่ฝฌไธบ numpy | |
| return sentence_embedding.cpu().numpy()[0] | |
| # ไฝฟ็จ็คบไพ๏ผ | |
| text = "Machine learning is a subset of artificial intelligence" | |
| vector = text_to_vector(text) | |
| print(f"่พๅ ฅ: {text}") | |
| print(f"ๅ้็ปดๅบฆ: {vector.shape}") # (384,) | |
| print(f"ๅ10ไธชๆฐๅญ: {vector[:10]}") | |
| print(f"ๅ้้ฟๅบฆ: {np.linalg.norm(vector)}") # ๅบ่ฏฅๆฏ 1.0 | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ไฝ ็้กน็ฎไธญ็็ฎๅ่ฐ็จ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| from langchain_community.embeddings import HuggingFaceEmbeddings | |
| embeddings = HuggingFaceEmbeddings( | |
| model_name="sentence-transformers/all-MiniLM-L6-v2" | |
| ) | |
| vector = embeddings.embed_query(text) | |
| # โ ่ฟไธ่กๅ ้จๆง่กไบไธ้ขๆๆ 7 ไธชๆญฅ้ชค๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| """) | |
| # ============================================================================ | |
| # ๅ ณ้ฎๆญฅ้ชคๆถ้ดๅๆ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("โฑ๏ธ ๅๆญฅ้ชค่ๆถๅๆ") | |
| print("=" * 80) | |
| print(""" | |
| ๅ่ฎพๅค็ไธไธชๅฅๅญ๏ผ10ไธช่ฏ๏ผ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Step 1: ๅ ่ฝฝๆจกๅ 0.5-2็ง (ๅช้ไธๆฌก๏ผๅฏๅค็จ) | |
| Step 2: ๅ่ฏ <1ๆฏซ็ง (้ๅธธๅฟซ) | |
| Step 3: Embedding ๆฅ่กจ <1ๆฏซ็ง (็ฉ้ต็ดขๅผ) | |
| Step 4: Transformer ๅค็ 10-50ๆฏซ็ง (6ๅฑ่ฎก็ฎ๏ผๆๆ ข) | |
| Step 5: Mean Pooling <1ๆฏซ็ง (็ฎๅๅนณๅ) | |
| Step 6: ๅฝไธๅ <1ๆฏซ็ง (็ฎๅ้คๆณ) | |
| Step 7: ่ฝฌๆขๆ ผๅผ <1ๆฏซ็ง | |
| ๆป่ๆถ: 10-50ๆฏซ็ง (GPU) ๆ 50-200ๆฏซ็ง (CPU) | |
| ๆน้ๅค็๏ผ20ไธชๅฅๅญ๏ผ: | |
| ๅไธชๅค็: 20 ร 50ms = 1000ms | |
| ๆน้ๅค็: 100ms โ ๅฟซ10ๅ๏ผ(GPUๅนถ่ก) | |
| ่ฟๅฐฑๆฏไธบไปไน่ฆๆน้ๅ้ๅ๏ผ | |
| """) | |
| print("\n" + "=" * 80) | |
| print("โ ๆๅญ่ฝฌๅ้็ๅฎ็ฐๆญฅ้ชค่ฎฒ่งฃๅฎๆฏ๏ผ") | |
| print("=" * 80) | |
| print(""" | |
| ๆ ธๅฟๆญฅ้ชคๅ้กพ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๆๅญ | |
| โ Step 1: ๅ ่ฝฝๆจกๅ | |
| Tokenizer + Model | |
| โ Step 2: ๅ่ฏ | |
| Token IDs: [101, 3698, 4083, ...] | |
| โ Step 3: Embedding ๆฅ่กจ | |
| ๅๅงๅ้: [(10, 384)] | |
| โ Step 4: Transformer ๅค็ | |
| ๆดๆฐๅ้: [(10, 384)] ๅ ๅซไธไธๆไฟกๆฏ | |
| โ Step 5: Mean Pooling | |
| ๅฅๅญๅ้: [(1, 384)] | |
| โ Step 6: ๅฝไธๅ | |
| ๅฝไธๅๅ้: [(1, 384)] ้ฟๅบฆ=1 | |
| โ Step 7: ่พๅบ | |
| ๆ็ปๅ้: [0.174, 0.221, ..., 0.122] | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ็ฐๅจไฝ ็ฅ้ไบๆฏไธๆญฅ็ๅ ทไฝๆไฝ๏ผ | |
| """) | |
| print() | |