amirhofo commited on
Commit
34dcb72
·
verified ·
1 Parent(s): 6534adc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -42
README.md CHANGED
@@ -1,43 +1,51 @@
1
- # Persian BPE Tokenizer (30K)
2
-
3
- A Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 30,000, trained on ~2M Persian texts with an average length of 10,000 characters for NLP tasks.
4
-
5
- ## Usage
6
-
7
- ### Encoding
8
- ```python
9
- from tokenizers import Tokenizer
10
- tokenizer= Tokenizer.from_file("Persian_BPE_Tokenizer_30K.json")
11
- encoded_text= tokenizer.encode("این یک متن آزمایشی است.")
12
- print("Tokens:", encoded_text.tokens)
13
- print("IDs:", encoded_text.ids)
14
- ```
15
-
16
- ### Decoding
17
- ```python
18
- decoded_text= tokenizer.decode_batch([[id] for id in encoded_text.ids])
19
- print("Decoded:", decoded_text)
20
- ```
21
-
22
- ## Training Data
23
- This tokenizer was trained on the following datasets:
24
- - Wikipedia (20231101.fa): https://huggingface.co/datasets/wikimedia/wikipedia
25
- - Persian Blog: https://huggingface.co/datasets/RohanAiLab/persian_blog
26
- - HomoRich: https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian
27
-
28
- ## License
29
- Code and tokenizer: MIT License
30
-
31
- ## Evaluation Metrics
32
- - UNK Rate: 0.0% (on 100,000 samples)
33
- - Compression Ratio: 4.56 (on 100,000 samples)
34
-
35
- ## Requirements
36
- - **For using the tokenizer**:
37
- - Python >= 3.9
38
- - tokenizers
39
- - **For training the tokenizer**:
40
- - pandas
41
- - datasets
42
- - requests
 
 
 
 
 
 
 
 
43
  - hazm
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - persian
5
+ - bpe
6
+ - tokenizer
7
+ ---
8
+
9
+ # Persian BPE Tokenizer (30K)
10
+
11
+ A Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 30,000, trained on ~2M Persian texts with an average length of 10,000 characters for NLP tasks.
12
+
13
+ ## Usage
14
+
15
+ ### Encoding
16
+ ```python
17
+ from tokenizers import Tokenizer
18
+ tokenizer= Tokenizer.from_file("Persian_BPE_Tokenizer_30K.json")
19
+ encoded_text= tokenizer.encode("این یک متن آزمایشی است.")
20
+ print("Tokens:", encoded_text.tokens)
21
+ print("IDs:", encoded_text.ids)
22
+ ```
23
+
24
+ ### Decoding
25
+ ```python
26
+ decoded_text= tokenizer.decode_batch([[id] for id in encoded_text.ids])
27
+ print("Decoded:", decoded_text)
28
+ ```
29
+
30
+ ## Training Data
31
+ This tokenizer was trained on the following datasets:
32
+ - Wikipedia (20231101.fa): https://huggingface.co/datasets/wikimedia/wikipedia
33
+ - Persian Blog: https://huggingface.co/datasets/RohanAiLab/persian_blog
34
+ - HomoRich: https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian
35
+
36
+ ## License
37
+ Code and tokenizer: MIT License
38
+
39
+ ## Evaluation Metrics
40
+ - UNK Rate: 0.0% (on 100,000 samples)
41
+ - Compression Ratio: 4.56 (on 100,000 samples)
42
+
43
+ ## Requirements
44
+ - **For using the tokenizer**:
45
+ - Python >= 3.9
46
+ - tokenizers
47
+ - **For training the tokenizer**:
48
+ - pandas
49
+ - datasets
50
+ - requests
51
  - hazm