amirhofo
/

Persian-BPE-Tokenizer

Model card Files Files and versions

amirhofo commited on Jul 24, 2025

Commit

34dcb72

·

verified ·

1 Parent(s): 6534adc

Update README.md

Files changed (1) hide show

README.md +50 -42

README.md CHANGED Viewed

@@ -1,43 +1,51 @@
-# Persian BPE Tokenizer (30K)
-A Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 30,000, trained on ~2M Persian texts with an average length of 10,000 characters for NLP tasks.
-## Usage
-### Encoding
-```python
-from tokenizers import Tokenizer
-tokenizer= Tokenizer.from_file("Persian_BPE_Tokenizer_30K.json")
-encoded_text= tokenizer.encode("این یک متن آزمایشی است.")
-print("Tokens:", encoded_text.tokens)
-print("IDs:", encoded_text.ids)
-```
-### Decoding
-```python
-decoded_text= tokenizer.decode_batch([[id] for id in encoded_text.ids])
-print("Decoded:", decoded_text)
-```
-## Training Data
-This tokenizer was trained on the following datasets:
-- Wikipedia (20231101.fa): https://huggingface.co/datasets/wikimedia/wikipedia
-- Persian Blog: https://huggingface.co/datasets/RohanAiLab/persian_blog
-- HomoRich: https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian
-## License
-Code and tokenizer: MIT License
-## Evaluation Metrics
-- UNK Rate: 0.0% (on 100,000 samples)
-- Compression Ratio: 4.56 (on 100,000 samples)
-## Requirements
-- **For using the tokenizer**:
-  - Python >= 3.9
-  - tokenizers
-- **For training the tokenizer**:
-  - pandas
-  - datasets
-  - requests
   - hazm

+---
+license: mit
+tags:
+  - persian
+  - bpe
+  - tokenizer
+---
+# Persian BPE Tokenizer (30K)
+A Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 30,000, trained on ~2M Persian texts with an average length of 10,000 characters for NLP tasks.
+## Usage
+### Encoding
+```python
+from tokenizers import Tokenizer
+tokenizer= Tokenizer.from_file("Persian_BPE_Tokenizer_30K.json")
+encoded_text= tokenizer.encode("این یک متن آزمایشی است.")
+print("Tokens:", encoded_text.tokens)
+print("IDs:", encoded_text.ids)
+```
+### Decoding
+```python
+decoded_text= tokenizer.decode_batch([[id] for id in encoded_text.ids])
+print("Decoded:", decoded_text)
+```
+## Training Data
+This tokenizer was trained on the following datasets:
+- Wikipedia (20231101.fa): https://huggingface.co/datasets/wikimedia/wikipedia
+- Persian Blog: https://huggingface.co/datasets/RohanAiLab/persian_blog
+- HomoRich: https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian
+## License
+Code and tokenizer: MIT License
+## Evaluation Metrics
+- UNK Rate: 0.0% (on 100,000 samples)
+- Compression Ratio: 4.56 (on 100,000 samples)
+## Requirements
+- **For using the tokenizer**:
+  - Python >= 3.9
+  - tokenizers
+- **For training the tokenizer**:
+  - pandas
+  - datasets
+  - requests
   - hazm