|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- zh |
|
|
- yue |
|
|
tags: |
|
|
- text-classification |
|
|
- zhlid |
|
|
- modernbert |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# ZHLID model card |
|
|
|
|
|
**Authors**: Lung-Chuan Chen |
|
|
|
|
|
**GitHub page**: https://github.com/Musubi-ai/ZHLID |
|
|
|
|
|
## Model information |
|
|
ZHLID is a classification model specialized in fine-grained Chinese varieties. It adopts [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) architecture and is trained with in-house dataset composed by Traditional Chinese and Simplified Chinese data. |
|
|
|
|
|
Unlike general-purpose LID tools, ZHLID focuses on distinguishing between closely related Chinese varieties, including: |
|
|
|
|
|
**Traditional Chinese (繁體中文)** – written in the traditional character set, used in formal and classical texts. |
|
|
**Simplified Chinese (簡體中文)** – written in the simplified character set, designed for easier reading and writing. |
|
|
**Cantonese (粵語)** – written form reflecting spoken Cantonese with unique vocabulary and grammar. |
|
|
**Classical Chinese (Traditional) (繁體文言文)** – literary Chinese in traditional characters with concise, classical syntax. |
|
|
**Classical Chinese (Simplified) (簡體文言文)** – literary Chinese in simplified characters, used in modern reprints and education. |
|
|
|
|
|
This makes ZHLID useful for linguistic research, corpus analysis, preprocessing for NLP tasks, or any application requiring accurate recognition of Chinese textual forms. |
|
|
|
|
|
The following table compares ZHLID with other popular LID tools supporting Chinese detection: |
|
|
|
|
|
| Identification | General Chinese | Traditional Chinese | Simplified Chinese | Classical Chinese | Cantonese | |
|
|
|------|:----:|:----:|:----:|:----:|:----:| |
|
|
| ZHLID (ours) | ✅ | ✅ | ✅ | ✅ | ✅ | |
|
|
| [langdetect](https://github.com/Mimino666/langdetect) | ✅ | ✅ | ✅ | ❌ | ❌ | |
|
|
| [GlotLID](https://github.com/cisnlp/GlotLID/tree/main) | ✅ | ❌ |❌ |❌ | ✅ | |
|
|
| [langid.py](https://github.com/saffsd/langid.py) | ✅ | ❌ | ❌ | ❌ | ❌ | |
|
|
| [CLD3](https://github.com/google/cld3?tab=readme-ov-file#supported-languages) | ✅ | ❌ | ❌ | ❌ | ❌ | |
|
|
| [Lingua](https://github.com/pemistahl/lingua-py) | ✅ | ❌ | ❌ | ❌ | ❌ | |
|
|
|
|
|
## Installation |
|
|
To use ZHLID model, install `transformers` with version higher than v4.48.0: |
|
|
```bash |
|
|
pip install -U transformers>=4.48.0 |
|
|
``` |
|
|
Optionally, you can install [flash-attention](https://github.com/Dao-AILab/flash-attention) to improve inference efficiency: |
|
|
```bash |
|
|
pip install flash-attn --no-build-isolation |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
With `pipeline` function in `transformers`: |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline("text-classification", model="MusubiAI/ZHLID") |
|
|
text = "孔子\n大成至圣先师孔丘,字仲尼,子姓,孔氏,敬称孔子、孔夫子,生于鲁昌平乡陬邑。" |
|
|
|
|
|
res = pipe(text) |
|
|
print(res) |
|
|
# [{'label': 'zhcn_classical', 'score': 0.9998414516448975}] |
|
|
``` |
|
|
|
|
|
With `AutoModelForSequenceClassification`: |
|
|
```python |
|
|
import torch |
|
|
import torch.nn as nn |
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
|
|
model_id = "MusubiAI/ZHLID" |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_id) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
id2label = model.config.id2label |
|
|
|
|
|
text = "孔子\n大成至圣先师孔丘,字仲尼,子姓,孔氏,敬称孔子、孔夫子,生于鲁昌平乡陬邑。" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs)["logits"] |
|
|
|
|
|
scores = nn.functional.softmax(logits, dim=-1) |
|
|
pred_score, pred_index = torch.max(scores, dim=-1) |
|
|
pred_score = pred_score.item() |
|
|
pred_index = pred_index.item() |
|
|
label = id2label[pred_index] |
|
|
prediction = {"label": label, "confidence_score": pred_score} |
|
|
print(prediction) |
|
|
# {'label': 'zhcn_classical', 'confidence_score': 0.99983811378479} |
|
|
``` |
|
|
Using `vllm` is also available: |
|
|
```python |
|
|
from vllm import LLM |
|
|
import torch |
|
|
import torch.nn.functional as F |
|
|
|
|
|
|
|
|
llm = LLM(model="MusubiAI/ZHLID", task="classify") |
|
|
|
|
|
|
|
|
text = "孔子\n大成至圣先师孔丘,字仲尼,子姓,孔氏,敬称孔子、孔夫子,生于鲁昌平乡陬邑。" |
|
|
|
|
|
output = llm.classify(text)[0] |
|
|
probs = output.outputs.probs |
|
|
probabilities = torch.tensor(output.outputs.probs) |
|
|
|
|
|
# Get the top predicted class |
|
|
top_idx = torch.argmax(probabilities).item() |
|
|
top_prob = probabilities[top_idx].item() |
|
|
|
|
|
print(f"Confidence: {top_prob:.4f}") |
|
|
|
|
|
id2label = { |
|
|
"0": "yue", |
|
|
"1": "zhcn_classical", |
|
|
"2": "zhtw_classical", |
|
|
"3": "zhcn", |
|
|
"4": "zhtw" |
|
|
} |
|
|
|
|
|
label = id2label[str(top_idx)] |
|
|
print(label) |
|
|
``` |
|
|
|
|
|
## Evaluation |
|
|
We compare our top-1 accuracy result with [GlotLID](https://github.com/cisnlp/GlotLID/tree/main) and [langdetect](https://github.com/Mimino666/langdetect). Note that since GlotLID only provides a general "cmn_Hani" label for Chinese, its performance on Traditional and Simplified Chinese is measured by whether it outputs this label for both categories. |
|
|
|
|
|
| Top-1 accuracy | Traditional Chinese | Simplified Chinese | Classical Chinese (Traditional) | Classical Chinese (Simplified) | Cantonese | |
|
|
|------|:----:|:----:|:----:|:----:|:----:| |
|
|
| ZHLID (ours) | 1.0 | 1.0 | 0.9 | 1.0 | 0.96 | |
|
|
| [GlotLID](https://github.com/cisnlp/GlotLID/tree/main) | 0.98 | 0.98 | - | - | 0.9 | |
|
|
| [langdetect](https://github.com/Mimino666/langdetect) | 0.3 | 0.9 | - | - | - | |
|
|
|
|
|
## License |
|
|
ZHLID model is released under the Apache 2.0 license. |
|
|
|
|
|
## Citation |
|
|
If you use ZHLID in your research, please cite this repository: |
|
|
```bibtex |
|
|
@misc{zhlid2025 , |
|
|
title = {ZHLID: Fine-grained Chinese Language Identification Package}, |
|
|
author = {Lung-Chuan Chen}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://github.com/Musubi-ai/ZHLID}} |
|
|
} |
|
|
``` |