File size: 1,289 Bytes
34dcb72
 
 
eb6970d
 
 
 
 
34dcb72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6534adc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: mit
tags:
- persian
- bpe
- tokenizer
language:
- fa
---

# Persian BPE Tokenizer (30K)

A Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 30,000, trained on ~2M Persian texts with an average length of 10,000 characters for NLP tasks.

## Usage

### Encoding
```python
from tokenizers import Tokenizer
tokenizer= Tokenizer.from_file("Persian_BPE_Tokenizer_30K.json")
encoded_text= tokenizer.encode("این یک متن آزمایشی است.")
print("Tokens:", encoded_text.tokens)
print("IDs:", encoded_text.ids)
```

### Decoding
```python
decoded_text= tokenizer.decode_batch([[id] for id in encoded_text.ids])
print("Decoded:", decoded_text)
```

## Training Data
This tokenizer was trained on the following datasets:
- Wikipedia (20231101.fa): https://huggingface.co/datasets/wikimedia/wikipedia
- Persian Blog: https://huggingface.co/datasets/RohanAiLab/persian_blog
- HomoRich: https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian

## License
Code and tokenizer: MIT License

## Evaluation Metrics
- UNK Rate: 0.0% (on 100,000 samples)
- Compression Ratio: 4.56 (on 100,000 samples)
  
## Requirements
- **For using the tokenizer**:
  - Python >= 3.9
  - tokenizers
- **For training the tokenizer**:
  - pandas
  - datasets
  - requests
  - hazm