--- license: apache-2.0 language: - lo pipeline_tag: feature-extraction library_name: transformers --- # LaoNLP-Enhanced Tokenizer This is an **enhanced Lao SentencePiece/WordLevel tokenizer**, built upon the original work of [savath/laonlp-enhanced](https://huggingface.co/savath/laonlp-enhanced). ๐Ÿ™ Special thanks to **Savath** for providing the base tokenizer. --- ## ๐Ÿ”น Update Notes - Cleaned vocab using dictionary-based validation. - Preserved encoding order of existing tokens. - Added support for **new words not present in the dictionary**. - Fully compatible with Hugging Face `PreTrainedTokenizerFast`. --- ## ๐Ÿ”น Installation ```bash pip install transformers ``` ### ๐Ÿ”น Update notes - Cleaned vocab using dictionary-based validation. - Preserved encoding order of existing tokens. - Added support for **new words not present in the dictionary**. - Compatible with Hugging Face `PreTrainedTokenizerFast`. ### ๐Ÿ”น Usage ```python from transformers import AutoTokenizer from tokenizers import pre_tokenizers tokenizer = AutoTokenizer.from_pretrained("LuoYiSULIXAY/laonlp-enhanced-update") tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer() tokens = tokenizer.tokenize(" เป€เบ›เบฑเบ™เปเบ™เบงเปƒเบ” เบชเบฐเบšเบฒเบเบ”เบต เบšเปเป€เบˆเบปเป‰เบฒเบฎเบนเป‰เบšเปเบงเปˆเบฒเบ‚เป‰เบญเบ เปเบกเปˆเบ™ เปƒเบœ") print(tokens) tokens = tokenizer.tokenize("เบ™เบตเป‰เปเบกเปˆเบ™เบเบฒเบ™เบ—เบปเบ”เบชเบญเบš") print(tokens) ``` ### ๐Ÿ”น Citation If you use this tokenizer, please cite both the original repo and this update version: ``` java @misc{savath2024laonlp, title = {LaoNLP-Enhanced}, author = {Savath}, year = {2024}, url = {https://huggingface.co/savath/laonlp-enhanced} } @misc{luoyi2025laonlpupdate, title = {LaoNLP-Enhanced-Update}, author = {Sulixay Vilaiphone (LuoYi)}, year = {2025}, url = {https://huggingface.co/LuoYiSULIXAY/laonlp-enhanced-update} } ``` โœ๏ธ Maintainer: Sulixay Vilaiphone (LuoYi) ๐Ÿ“ง Contact: Sulixay2001@gmail.com