You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Rubai PII Detection v1 (Latin) - Uzbek Personal Information Detector

A BERT-based Named Entity Recognition model for detecting Personal Identifiable Information (PII) in Uzbek text. Trained on 475K+ samples covering both standard text and speech-normalized (ASR) output.

v1.3 Update (January 2026): Added CARD_NUMBER detection for UzCard, HUMO, Visa, and Mastercard. Now detects 7 entity types! See Changelog for details.

Quick Start

from transformers import pipeline

# Load the model
ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")

# Detect PII in text
text = "Sardor Rustamov telefon raqami 90 123 45 67, pasport AA1234567, karta 8600 1234 5678 9012"
results = ner(text)

for entity in results:
    print(f"{entity['entity_group']}: {entity['word']}")

Output:

NAME: Sardor Rustamov
PHONE: 90 123 45 67
DOCUMENT_ID: AA1234567
CARD_NUMBER: 8600 1234 5678 9012

Supported Entity Types

Entity	Description	Examples
`NAME`	Person names	Sardor Rustamov, Karimova Nilufar Shavkatovna
`PHONE`	Phone numbers	90 123 45 67, +998 91 234 56 78
`DATE`	Dates	15-yanvar 2025-yil, 25.12.2024
`ADDRESS`	Physical addresses	Toshkent shahri Chilonzor tumani 5-mavze 12-uy
`DOCUMENT_ID`	Document numbers	AA1234567, AB9876543, HTA159387
`CARD_NUMBER`	Bank card numbers	8600 1234 5678 9012, 9860 9876 5432 1098
`TEXT`	Non-PII text	Everything else

Card Number Support (New in v1.3)

Detects major card types used in Uzbekistan:

Card Type	Prefix	Example
UzCard	8600	8600 1234 5678 9012
HUMO	9860	9860 9876 5432 1098
Visa	4	4012 8888 8888 1881
Mastercard	51-55	5123 4567 8901 2345

Works with both digit format (8600 1234 5678 9012) and spoken format (sakkiz olti nol nol bir ikki uch...).

Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model
model_name = "islomov/rubai-PII-detection-v1-latin"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Predict
text = "Hurmatli Aziza Karimova! Karta: 8600 1234 5678 9012."
inputs = tokenizer(text.split(), is_split_into_words=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)

# Map predictions to labels
id2label = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions[0]):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token}: {id2label[pred.item()]}")

Extract Entities as Dictionary

from transformers import pipeline

def extract_pii(text: str) -> dict:
    """Extract all PII entities from text."""
    ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")
    results = ner(text)

    entities = {}
    for r in results:
        label = r["entity_group"]
        if label != "TEXT":
            entities.setdefault(label, []).append(r["word"])

    return entities

# Example
text = """
Mijoz Rustam Aliyev 2024-yil 15-noyabr kuni 93 456 78 90 raqamidan qo'ng'iroq qildi.
Manzili: Samarqand Registon ko'chasi 10-uy. Pasport: AB7654321.
Karta: 8600 1111 2222 3333.
"""

pii = extract_pii(text)
print(pii)
# {'NAME': ['Rustam Aliyev'], 'DATE': ['2024-yil 15-noyabr'], 'PHONE': ['93 456 78 90'],
#  'ADDRESS': ['Samarqand Registon ko'chasi 10-uy'], 'DOCUMENT_ID': ['AB7654321'],
#  'CARD_NUMBER': ['8600 1111 2222 3333']}

PII Masking / Anonymization

from transformers import pipeline

def mask_pii(text: str, mask_char: str = "█") -> str:
    """Replace PII with mask characters."""
    ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")
    results = ner(text)

    # Sort by start position (descending) to replace from end
    results = sorted(results, key=lambda x: x["start"], reverse=True)

    masked = text
    for r in results:
        if r["entity_group"] != "TEXT":
            mask = mask_char * len(r["word"])
            masked = masked[:r["start"]] + mask + masked[r["end"]:]

    return masked

# Example
text = "Sardor Rustamov telefon 90 123 45 67, karta 8600 1234 5678 9012"
print(mask_pii(text))
# ███████████████ telefon ████████████, karta ███████████████████

Batch Processing

from transformers import pipeline

ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")

texts = [
    "Sardor Rustamov telefon: 90 111 22 33",
    "Pasport raqami AA1234567",
    "Karta: 8600 1234 5678 9012",
    "Toshkent shahri Yunusobod tumani",
]

# Process batch
for text in texts:
    entities = ner(text)
    pii_found = [e for e in entities if e["entity_group"] != "TEXT"]
    print(f"Text: {text[:50]}...")
    print(f"PII: {[(e['entity_group'], e['word']) for e in pii_found]}\n")

Model Performance

Overall Metrics (v1.3)

Metric	Score
Precision	96.1%
Recall	96.0%
F1 Score	96.1%

Accuracy by Entity Type

Entity	Precision	Recall	F1
NAME	96.2%	96.8%	96.5%
PHONE	97.1%	97.5%	97.3%
DATE	95.8%	96.1%	95.9%
ADDRESS	94.5%	95.2%	94.8%
DOCUMENT_ID	95.3%	94.8%	95.0%
CARD_NUMBER	97.8%	98.2%	98.0%
Overall	96.1%	96.0%	96.1%

Test Results (v1.3)

Category	Pass Rate	Notes
Card Numbers	100% (8/8)	UzCard, HUMO, Visa all detected
False Positives	100% (24/24)	Won't flag prices, ages, quantities
Original Text	87%	Standard text with digits
Denormalized	76%	Speech/ASR output with number words

Key Improvements in v1.3:

✅ CARD_NUMBER: 100% detection rate (was 0% in v1.2)
✅ Zero false positives on non-PII numbers (prices, ages, etc.)
✅ Cleaned training data (removed 3,211 misaligned samples)

Training Data

Trained on 475,185 samples from diverse domains:

Source	Samples	Description
Original HuggingFace	~150K	Standard text from rubai-NER-150K-Personal
LLM Generated	~200K	Synthetic data with realistic PII patterns
Card Number Samples	~72K	UzCard, HUMO, Visa in various formats
Denormalized	~50K	Numbers written as words (ASR simulation)

100+ domains covered: banking, healthcare, education, e-commerce, government, travel, telecom, and more.

Use Cases

Data Privacy Compliance - Automatically detect PII before data sharing
Document Redaction - Mask sensitive information in documents
Customer Support - Flag conversations containing personal data
Data Anonymization - Prepare datasets for ML training
Financial Services - Detect card numbers in chat logs
Audit & Monitoring - Track PII in logs and communications

Limitations

Optimized for Latin script Uzbek text (Cyrillic support limited)
Casual addresses with house numbers work (chilonzor 12-uy, yunusobod 5-kvartal 23-uy)
Landmark-only references without house numbers are correctly NOT detected (chorsu yonida — not a full address)
Denormalized phone numbers sometimes confused with card numbers
Does not detect: email addresses, URLs (coming in v2)

Changelog

v1.3 (January 2026)

New: Added CARD_NUMBER entity type (7 entities total)
New: Supports UzCard (8600), HUMO (9860), Visa (4), Mastercard (51-55)
New: Detects card numbers in spoken format (sakkiz olti nol nol...)
Improved: Dataset cleaned - removed 3,211 misaligned samples
Improved: Fixed 2,888 invalid/noise labels
Training: 475K samples (up from 400K)
F1 Score: 96.1%

v1.2 (January 2026)

Added denormalized text support for ASR output
Improved address detection
Added more document ID formats (HTA, HAA)

v1.0 (January 2026)

Initial release with 5 entity types
400K training samples
F1 Score: 95.7%

Model Details

Base Model: tahrirchi/tahrirchi-bert-base
Architecture: BERT for Token Classification
Labels: 7 (TEXT, PHONE, DATE, NAME, ADDRESS, DOCUMENT_ID, CARD_NUMBER)
Training: 5 epochs, batch size 32, learning rate 2e-5
Languages: Uzbek (Latin script)
License: CC BY-NC 4.0 (Non-Commercial)

License

This model is released under Apache license 2.0.

Attribution

When using this model, please cite:

Rubai PII Detection v1.3 by Rubai AI
https://huggingface.co/islomov/rubai-PII-detection-v1-latin

Citation

@misc{rubai-pii-detection-2025,
  author = {Sardor Islomov, Rubai AI},
  title = {Rubai PII Detection v1.3 - Uzbek Personal Information Detector},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/islomov/rubai-PII-detection-v1-latin}
}

Contact

Email: islomov49@gmail.com

Made with ❤️ by Rubai AI

Downloads last month: 18

Safetensors

Model size

0.1B params

Tensor type

F32

islomov
/

rubai-PII-detection-v1-latin