Rubai PII Detection v1 (Latin) - Uzbek Personal Information Detector
A BERT-based Named Entity Recognition model for detecting Personal Identifiable Information (PII) in Uzbek text. Trained on 475K+ samples covering both standard text and speech-normalized (ASR) output.
v1.3 Update (January 2026): Added CARD_NUMBER detection for UzCard, HUMO, Visa, and Mastercard. Now detects 7 entity types! See Changelog for details.
Quick Start
from transformers import pipeline
# Load the model
ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")
# Detect PII in text
text = "Sardor Rustamov telefon raqami 90 123 45 67, pasport AA1234567, karta 8600 1234 5678 9012"
results = ner(text)
for entity in results:
print(f"{entity['entity_group']}: {entity['word']}")
Output:
NAME: Sardor Rustamov
PHONE: 90 123 45 67
DOCUMENT_ID: AA1234567
CARD_NUMBER: 8600 1234 5678 9012
Supported Entity Types
| Entity | Description | Examples |
|---|---|---|
NAME |
Person names | Sardor Rustamov, Karimova Nilufar Shavkatovna |
PHONE |
Phone numbers | 90 123 45 67, +998 91 234 56 78 |
DATE |
Dates | 15-yanvar 2025-yil, 25.12.2024 |
ADDRESS |
Physical addresses | Toshkent shahri Chilonzor tumani 5-mavze 12-uy |
DOCUMENT_ID |
Document numbers | AA1234567, AB9876543, HTA159387 |
CARD_NUMBER |
Bank card numbers | 8600 1234 5678 9012, 9860 9876 5432 1098 |
TEXT |
Non-PII text | Everything else |
Card Number Support (New in v1.3)
Detects major card types used in Uzbekistan:
| Card Type | Prefix | Example |
|---|---|---|
| UzCard | 8600 | 8600 1234 5678 9012 |
| HUMO | 9860 | 9860 9876 5432 1098 |
| Visa | 4 | 4012 8888 8888 1881 |
| Mastercard | 51-55 | 5123 4567 8901 2345 |
Works with both digit format (8600 1234 5678 9012) and spoken format (sakkiz olti nol nol bir ikki uch...).
Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model
model_name = "islomov/rubai-PII-detection-v1-latin"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Predict
text = "Hurmatli Aziza Karimova! Karta: 8600 1234 5678 9012."
inputs = tokenizer(text.split(), is_split_into_words=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
# Map predictions to labels
id2label = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions[0]):
if token not in ["[CLS]", "[SEP]", "[PAD]"]:
print(f"{token}: {id2label[pred.item()]}")
Extract Entities as Dictionary
from transformers import pipeline
def extract_pii(text: str) -> dict:
"""Extract all PII entities from text."""
ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")
results = ner(text)
entities = {}
for r in results:
label = r["entity_group"]
if label != "TEXT":
entities.setdefault(label, []).append(r["word"])
return entities
# Example
text = """
Mijoz Rustam Aliyev 2024-yil 15-noyabr kuni 93 456 78 90 raqamidan qo'ng'iroq qildi.
Manzili: Samarqand Registon ko'chasi 10-uy. Pasport: AB7654321.
Karta: 8600 1111 2222 3333.
"""
pii = extract_pii(text)
print(pii)
# {'NAME': ['Rustam Aliyev'], 'DATE': ['2024-yil 15-noyabr'], 'PHONE': ['93 456 78 90'],
# 'ADDRESS': ['Samarqand Registon ko'chasi 10-uy'], 'DOCUMENT_ID': ['AB7654321'],
# 'CARD_NUMBER': ['8600 1111 2222 3333']}
PII Masking / Anonymization
from transformers import pipeline
def mask_pii(text: str, mask_char: str = "█") -> str:
"""Replace PII with mask characters."""
ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")
results = ner(text)
# Sort by start position (descending) to replace from end
results = sorted(results, key=lambda x: x["start"], reverse=True)
masked = text
for r in results:
if r["entity_group"] != "TEXT":
mask = mask_char * len(r["word"])
masked = masked[:r["start"]] + mask + masked[r["end"]:]
return masked
# Example
text = "Sardor Rustamov telefon 90 123 45 67, karta 8600 1234 5678 9012"
print(mask_pii(text))
# ███████████████ telefon ████████████, karta ███████████████████
Batch Processing
from transformers import pipeline
ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")
texts = [
"Sardor Rustamov telefon: 90 111 22 33",
"Pasport raqami AA1234567",
"Karta: 8600 1234 5678 9012",
"Toshkent shahri Yunusobod tumani",
]
# Process batch
for text in texts:
entities = ner(text)
pii_found = [e for e in entities if e["entity_group"] != "TEXT"]
print(f"Text: {text[:50]}...")
print(f"PII: {[(e['entity_group'], e['word']) for e in pii_found]}\n")
Model Performance
Overall Metrics (v1.3)
| Metric | Score |
|---|---|
| Precision | 96.1% |
| Recall | 96.0% |
| F1 Score | 96.1% |
Accuracy by Entity Type
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| NAME | 96.2% | 96.8% | 96.5% |
| PHONE | 97.1% | 97.5% | 97.3% |
| DATE | 95.8% | 96.1% | 95.9% |
| ADDRESS | 94.5% | 95.2% | 94.8% |
| DOCUMENT_ID | 95.3% | 94.8% | 95.0% |
| CARD_NUMBER | 97.8% | 98.2% | 98.0% |
| Overall | 96.1% | 96.0% | 96.1% |
Test Results (v1.3)
| Category | Pass Rate | Notes |
|---|---|---|
| Card Numbers | 100% (8/8) | UzCard, HUMO, Visa all detected |
| False Positives | 100% (24/24) | Won't flag prices, ages, quantities |
| Original Text | 87% | Standard text with digits |
| Denormalized | 76% | Speech/ASR output with number words |
Key Improvements in v1.3:
- ✅ CARD_NUMBER: 100% detection rate (was 0% in v1.2)
- ✅ Zero false positives on non-PII numbers (prices, ages, etc.)
- ✅ Cleaned training data (removed 3,211 misaligned samples)
Training Data
Trained on 475,185 samples from diverse domains:
| Source | Samples | Description |
|---|---|---|
| Original HuggingFace | ~150K | Standard text from rubai-NER-150K-Personal |
| LLM Generated | ~200K | Synthetic data with realistic PII patterns |
| Card Number Samples | ~72K | UzCard, HUMO, Visa in various formats |
| Denormalized | ~50K | Numbers written as words (ASR simulation) |
100+ domains covered: banking, healthcare, education, e-commerce, government, travel, telecom, and more.
Use Cases
- Data Privacy Compliance - Automatically detect PII before data sharing
- Document Redaction - Mask sensitive information in documents
- Customer Support - Flag conversations containing personal data
- Data Anonymization - Prepare datasets for ML training
- Financial Services - Detect card numbers in chat logs
- Audit & Monitoring - Track PII in logs and communications
Limitations
- Optimized for Latin script Uzbek text (Cyrillic support limited)
- Casual addresses with house numbers work (
chilonzor 12-uy,yunusobod 5-kvartal 23-uy) - Landmark-only references without house numbers are correctly NOT detected (
chorsu yonida— not a full address) - Denormalized phone numbers sometimes confused with card numbers
- Does not detect: email addresses, URLs (coming in v2)
Changelog
v1.3 (January 2026)
- New: Added
CARD_NUMBERentity type (7 entities total) - New: Supports UzCard (8600), HUMO (9860), Visa (4), Mastercard (51-55)
- New: Detects card numbers in spoken format (sakkiz olti nol nol...)
- Improved: Dataset cleaned - removed 3,211 misaligned samples
- Improved: Fixed 2,888 invalid/noise labels
- Training: 475K samples (up from 400K)
- F1 Score: 96.1%
v1.2 (January 2026)
- Added denormalized text support for ASR output
- Improved address detection
- Added more document ID formats (HTA, HAA)
v1.0 (January 2026)
- Initial release with 5 entity types
- 400K training samples
- F1 Score: 95.7%
Model Details
- Base Model: tahrirchi/tahrirchi-bert-base
- Architecture: BERT for Token Classification
- Labels: 7 (TEXT, PHONE, DATE, NAME, ADDRESS, DOCUMENT_ID, CARD_NUMBER)
- Training: 5 epochs, batch size 32, learning rate 2e-5
- Languages: Uzbek (Latin script)
- License: CC BY-NC 4.0 (Non-Commercial)
License
This model is released under Apache license 2.0.
Attribution
When using this model, please cite:
Rubai PII Detection v1.3 by Rubai AI
https://huggingface.co/islomov/rubai-PII-detection-v1-latin
Citation
@misc{rubai-pii-detection-2025,
author = {Sardor Islomov, Rubai AI},
title = {Rubai PII Detection v1.3 - Uzbek Personal Information Detector},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/islomov/rubai-PII-detection-v1-latin}
}
Contact
- Email: islomov49@gmail.com
Made with ❤️ by Rubai AI
- Downloads last month
- 18