VoVanPhuc
/

sup-SimCSE-VietNamese-phobert-base

@@ -3,12 +3,11 @@
 1. [Introduction](#introduction)
 2. [Pretrain model](#models)
 3. [Using SimeCSE_Vietnamese with `sentences-transformers`](#sentences-transformers)
-\\t- [Installation](#install1)
-\\t- [Example usage](#usage1)
 4. [Using SimeCSE_Vietnamese with `transformers`](#transformers)
-\\t- [Installation](#install2)
-\\t- [Example usage](#usage2)
 # <a name="introduction"></a> SimeCSE_Vietnamese: Simple Contrastive Learning of Sentence Embeddings with Vietnamese
 Pre-trained SimeCSE_Vietnamese models are the state-of-the-art of Sentence Embeddings with Vietnamese :
@@ -20,7 +19,7 @@ Pre-trained SimeCSE_Vietnamese models are the state-of-the-art of Sentence Embed
 ## Pre-trained models <a name="models"></a>
-Model | #params | Arch.\\t
 ---|---|---
 [`VoVanPhuc/sup-SimCSE-VietNamese-phobert-base`](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) | 135M | base
 [`VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base`](https://huggingface.co/VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base) | 135M | base
@@ -31,13 +30,19 @@ Model | #params | Arch.\\t
 ### Installation <a name="install1"></a>
  -  Install `sentence-transformers`:
-\\t- `pip install -U sentence-transformers`
-\\t
 ### Example usage <a name="usage1"></a>
 ```python
 from sentence_transformers import SentenceTransformer
 model = SentenceTransformer('VoVanPhuc/sup-SimCSE-VietNamese-phobert-base')
 sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
@@ -52,6 +57,7 @@ sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
           'Bắn chết người trong cuộc rượt đuổi trên sông.'
           ]
 embeddings = model.encode(sentences)
 ```
@@ -59,16 +65,22 @@ embeddings = model.encode(sentences)
 ### Installation <a name="install2"></a>
  -  Install `transformers`:
-\\t- `pip install -U transformers`
-\\t
 ### Example usage <a name="usage2"></a>
 ```python
 import torch
 from transformers import AutoModel, AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")
 model = AutoModel.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")
 sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
@@ -82,7 +94,10 @@ sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
           'Chủ ki-ốt bị đâm chết trong chợ đầu mối lớn nhất Thanh Hoá.',
           'Bắn chết người trong cuộc rượt đuổi trên sông.'
           ]
-inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
 with torch.no_grad():
     embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
@@ -94,12 +109,12 @@ with torch.no_grad():
 ## Citation
-\\t@article{gao2021simcse,
-\\t   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
-\\t   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
-\\t   journal={arXiv preprint arXiv:2104.08821},
-\\t   year={2021}
-\\t}
     @inproceedings{phobert,
     title     = {{PhoBERT: Pre-trained language models for Vietnamese}},

 1. [Introduction](#introduction)
 2. [Pretrain model](#models)
 3. [Using SimeCSE_Vietnamese with `sentences-transformers`](#sentences-transformers)
+	- [Installation](#install1)
+	- [Example usage](#usage1)
 4. [Using SimeCSE_Vietnamese with `transformers`](#transformers)
+	- [Installation](#install2)
+	- [Example usage](#usage2)
 # <a name="introduction"></a> SimeCSE_Vietnamese: Simple Contrastive Learning of Sentence Embeddings with Vietnamese
 Pre-trained SimeCSE_Vietnamese models are the state-of-the-art of Sentence Embeddings with Vietnamese :
 ## Pre-trained models <a name="models"></a>
+Model | #params | Arch.
 ---|---|---
 [`VoVanPhuc/sup-SimCSE-VietNamese-phobert-base`](https://huggingface.co/VoVanPhuc/sup-SimCSE-VietNamese-phobert-base) | 135M | base
 [`VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base`](https://huggingface.co/VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base) | 135M | base
 ### Installation <a name="install1"></a>
  -  Install `sentence-transformers`:
+	- `pip install -U sentence-transformers`
+ - Install `pyvi` to word segment:
+	- `pip install pyvi`
 ### Example usage <a name="usage1"></a>
 ```python
 from sentence_transformers import SentenceTransformer
+from pyvi.ViTokenizer import tokenize
 model = SentenceTransformer('VoVanPhuc/sup-SimCSE-VietNamese-phobert-base')
 sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
           'Bắn chết người trong cuộc rượt đuổi trên sông.'
           ]
+sentences = [tokenize(sentence) for sentence in sentences]
 embeddings = model.encode(sentences)
 ```
 ### Installation <a name="install2"></a>
  -  Install `transformers`:
+	- `pip install -U transformers`
+ - Install `pyvi` to word segment:
+	- `pip install pyvi`
 ### Example usage <a name="usage2"></a>
 ```python
 import torch
 from transformers import AutoModel, AutoTokenizer
+from pyvi.ViTokenizer import tokenize
+PhobertTokenizer = AutoTokenizer.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")
 model = AutoModel.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")
 sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
           'Chủ ki-ốt bị đâm chết trong chợ đầu mối lớn nhất Thanh Hoá.',
           'Bắn chết người trong cuộc rượt đuổi trên sông.'
           ]
+sentences = [tokenize(sentence) for sentence in sentences]
+inputs = PhobertTokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
 with torch.no_grad():
     embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
 ## Citation
+	@article{gao2021simcse,
+	   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
+	   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
+	   journal={arXiv preprint arXiv:2104.08821},
+	   year={2021}
+	}
     @inproceedings{phobert,
     title     = {{PhoBERT: Pre-trained language models for Vietnamese}},

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a336f5527992f386d57f2c2721e90dad9b63bacabc0103866bddea2c1d469c6b
 size 542443775

 version https://git-lfs.github.com/spec/v1
+oid sha256:920246a089ab078ab493cf03c42c6a6d788683d319d97a48e4dcae8eeed2220a
 size 542443775