File size: 1,289 Bytes
34dcb72 eb6970d 34dcb72 6534adc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | ---
license: mit
tags:
- persian
- bpe
- tokenizer
language:
- fa
---
# Persian BPE Tokenizer (30K)
A Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 30,000, trained on ~2M Persian texts with an average length of 10,000 characters for NLP tasks.
## Usage
### Encoding
```python
from tokenizers import Tokenizer
tokenizer= Tokenizer.from_file("Persian_BPE_Tokenizer_30K.json")
encoded_text= tokenizer.encode("این یک متن آزمایشی است.")
print("Tokens:", encoded_text.tokens)
print("IDs:", encoded_text.ids)
```
### Decoding
```python
decoded_text= tokenizer.decode_batch([[id] for id in encoded_text.ids])
print("Decoded:", decoded_text)
```
## Training Data
This tokenizer was trained on the following datasets:
- Wikipedia (20231101.fa): https://huggingface.co/datasets/wikimedia/wikipedia
- Persian Blog: https://huggingface.co/datasets/RohanAiLab/persian_blog
- HomoRich: https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian
## License
Code and tokenizer: MIT License
## Evaluation Metrics
- UNK Rate: 0.0% (on 100,000 samples)
- Compression Ratio: 4.56 (on 100,000 samples)
## Requirements
- **For using the tokenizer**:
- Python >= 3.9
- tokenizers
- **For training the tokenizer**:
- pandas
- datasets
- requests
- hazm |