---
license: apache-2.0
language:
- lo
pipeline_tag: feature-extraction
library_name: transformers
---
# LaoNLP-Enhanced Tokenizer


This is an **enhanced Lao SentencePiece/WordLevel tokenizer**, built upon the original work of [savath/laonlp-enhanced](https://huggingface.co/savath/laonlp-enhanced).  
🙏 Special thanks to **Savath** for providing the base tokenizer.  

---

## 🔹 Update Notes
- Cleaned vocab using dictionary-based validation.  
- Preserved encoding order of existing tokens.  
- Added support for **new words not present in the dictionary**.  
- Fully compatible with Hugging Face `PreTrainedTokenizerFast`.  

---

## 🔹 Installation
```bash
pip install transformers
```

### 🔹 Update notes
- Cleaned vocab using dictionary-based validation.
- Preserved encoding order of existing tokens.
- Added support for **new words not present in the dictionary**.
- Compatible with Hugging Face `PreTrainedTokenizerFast`.

### 🔹 Usage
```python
from transformers import AutoTokenizer
from tokenizers import pre_tokenizers

tokenizer = AutoTokenizer.from_pretrained("LuoYiSULIXAY/laonlp-enhanced-update")
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokens = tokenizer.tokenize(" ເປັນແນວໃດ ສະບາຍດີ ບໍເຈົ້າຮູ້ບໍວ່າຂ້ອຍ ແມ່ນ ໃຜ")
print(tokens)


tokens = tokenizer.tokenize("ນີ້ແມ່ນການທົດສອບ")
print(tokens)
```

### 🔹 Citation

If you use this tokenizer, please cite both the original repo and this update version:
```
java
@misc{savath2024laonlp,
  title   = {LaoNLP-Enhanced},
  author  = {Savath},
  year    = {2024},
  url     = {https://huggingface.co/savath/laonlp-enhanced}
}

@misc{luoyi2025laonlpupdate,
  title   = {LaoNLP-Enhanced-Update},
  author  = {Sulixay Vilaiphone (LuoYi)},
  year    = {2025},
  url     = {https://huggingface.co/LuoYiSULIXAY/laonlp-enhanced-update}
}
```

✍️ Maintainer: Sulixay Vilaiphone (LuoYi)

📧 Contact: Sulixay2001@gmail.com