You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Rubai PII Detection v1 (Latin) - Uzbek Personal Information Detector

A BERT-based Named Entity Recognition model for detecting Personal Identifiable Information (PII) in Uzbek text. Trained on 475K+ samples covering both standard text and speech-normalized (ASR) output.

v1.3 Update (January 2026): Added CARD_NUMBER detection for UzCard, HUMO, Visa, and Mastercard. Now detects 7 entity types! See Changelog for details.

Quick Start

from transformers import pipeline

# Load the model
ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")

# Detect PII in text
text = "Sardor Rustamov telefon raqami 90 123 45 67, pasport AA1234567, karta 8600 1234 5678 9012"
results = ner(text)

for entity in results:
    print(f"{entity['entity_group']}: {entity['word']}")

Output:

NAME: Sardor Rustamov
PHONE: 90 123 45 67
DOCUMENT_ID: AA1234567
CARD_NUMBER: 8600 1234 5678 9012

Supported Entity Types

Entity Description Examples
NAME Person names Sardor Rustamov, Karimova Nilufar Shavkatovna
PHONE Phone numbers 90 123 45 67, +998 91 234 56 78
DATE Dates 15-yanvar 2025-yil, 25.12.2024
ADDRESS Physical addresses Toshkent shahri Chilonzor tumani 5-mavze 12-uy
DOCUMENT_ID Document numbers AA1234567, AB9876543, HTA159387
CARD_NUMBER Bank card numbers 8600 1234 5678 9012, 9860 9876 5432 1098
TEXT Non-PII text Everything else

Card Number Support (New in v1.3)

Detects major card types used in Uzbekistan:

Card Type Prefix Example
UzCard 8600 8600 1234 5678 9012
HUMO 9860 9860 9876 5432 1098
Visa 4 4012 8888 8888 1881
Mastercard 51-55 5123 4567 8901 2345

Works with both digit format (8600 1234 5678 9012) and spoken format (sakkiz olti nol nol bir ikki uch...).

Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model
model_name = "islomov/rubai-PII-detection-v1-latin"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Predict
text = "Hurmatli Aziza Karimova! Karta: 8600 1234 5678 9012."
inputs = tokenizer(text.split(), is_split_into_words=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)

# Map predictions to labels
id2label = model.config.id2label
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred in zip(tokens, predictions[0]):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        print(f"{token}: {id2label[pred.item()]}")

Extract Entities as Dictionary

from transformers import pipeline

def extract_pii(text: str) -> dict:
    """Extract all PII entities from text."""
    ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")
    results = ner(text)

    entities = {}
    for r in results:
        label = r["entity_group"]
        if label != "TEXT":
            entities.setdefault(label, []).append(r["word"])

    return entities

# Example
text = """
Mijoz Rustam Aliyev 2024-yil 15-noyabr kuni 93 456 78 90 raqamidan qo'ng'iroq qildi.
Manzili: Samarqand Registon ko'chasi 10-uy. Pasport: AB7654321.
Karta: 8600 1111 2222 3333.
"""

pii = extract_pii(text)
print(pii)
# {'NAME': ['Rustam Aliyev'], 'DATE': ['2024-yil 15-noyabr'], 'PHONE': ['93 456 78 90'],
#  'ADDRESS': ['Samarqand Registon ko'chasi 10-uy'], 'DOCUMENT_ID': ['AB7654321'],
#  'CARD_NUMBER': ['8600 1111 2222 3333']}

PII Masking / Anonymization

from transformers import pipeline

def mask_pii(text: str, mask_char: str = "█") -> str:
    """Replace PII with mask characters."""
    ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")
    results = ner(text)

    # Sort by start position (descending) to replace from end
    results = sorted(results, key=lambda x: x["start"], reverse=True)

    masked = text
    for r in results:
        if r["entity_group"] != "TEXT":
            mask = mask_char * len(r["word"])
            masked = masked[:r["start"]] + mask + masked[r["end"]:]

    return masked

# Example
text = "Sardor Rustamov telefon 90 123 45 67, karta 8600 1234 5678 9012"
print(mask_pii(text))
# ███████████████ telefon ████████████, karta ███████████████████

Batch Processing

from transformers import pipeline

ner = pipeline("ner", model="islomov/rubai-PII-detection-v1-latin", aggregation_strategy="simple")

texts = [
    "Sardor Rustamov telefon: 90 111 22 33",
    "Pasport raqami AA1234567",
    "Karta: 8600 1234 5678 9012",
    "Toshkent shahri Yunusobod tumani",
]

# Process batch
for text in texts:
    entities = ner(text)
    pii_found = [e for e in entities if e["entity_group"] != "TEXT"]
    print(f"Text: {text[:50]}...")
    print(f"PII: {[(e['entity_group'], e['word']) for e in pii_found]}\n")

Model Performance

Overall Metrics (v1.3)

Metric Score
Precision 96.1%
Recall 96.0%
F1 Score 96.1%

Accuracy by Entity Type

Entity Precision Recall F1
NAME 96.2% 96.8% 96.5%
PHONE 97.1% 97.5% 97.3%
DATE 95.8% 96.1% 95.9%
ADDRESS 94.5% 95.2% 94.8%
DOCUMENT_ID 95.3% 94.8% 95.0%
CARD_NUMBER 97.8% 98.2% 98.0%
Overall 96.1% 96.0% 96.1%

Test Results (v1.3)

Category Pass Rate Notes
Card Numbers 100% (8/8) UzCard, HUMO, Visa all detected
False Positives 100% (24/24) Won't flag prices, ages, quantities
Original Text 87% Standard text with digits
Denormalized 76% Speech/ASR output with number words

Key Improvements in v1.3:

  • ✅ CARD_NUMBER: 100% detection rate (was 0% in v1.2)
  • ✅ Zero false positives on non-PII numbers (prices, ages, etc.)
  • ✅ Cleaned training data (removed 3,211 misaligned samples)

Training Data

Trained on 475,185 samples from diverse domains:

Source Samples Description
Original HuggingFace ~150K Standard text from rubai-NER-150K-Personal
LLM Generated ~200K Synthetic data with realistic PII patterns
Card Number Samples ~72K UzCard, HUMO, Visa in various formats
Denormalized ~50K Numbers written as words (ASR simulation)

100+ domains covered: banking, healthcare, education, e-commerce, government, travel, telecom, and more.

Use Cases

  1. Data Privacy Compliance - Automatically detect PII before data sharing
  2. Document Redaction - Mask sensitive information in documents
  3. Customer Support - Flag conversations containing personal data
  4. Data Anonymization - Prepare datasets for ML training
  5. Financial Services - Detect card numbers in chat logs
  6. Audit & Monitoring - Track PII in logs and communications

Limitations

  • Optimized for Latin script Uzbek text (Cyrillic support limited)
  • Casual addresses with house numbers work (chilonzor 12-uy, yunusobod 5-kvartal 23-uy)
  • Landmark-only references without house numbers are correctly NOT detected (chorsu yonida — not a full address)
  • Denormalized phone numbers sometimes confused with card numbers
  • Does not detect: email addresses, URLs (coming in v2)

Changelog

v1.3 (January 2026)

  • New: Added CARD_NUMBER entity type (7 entities total)
  • New: Supports UzCard (8600), HUMO (9860), Visa (4), Mastercard (51-55)
  • New: Detects card numbers in spoken format (sakkiz olti nol nol...)
  • Improved: Dataset cleaned - removed 3,211 misaligned samples
  • Improved: Fixed 2,888 invalid/noise labels
  • Training: 475K samples (up from 400K)
  • F1 Score: 96.1%

v1.2 (January 2026)

  • Added denormalized text support for ASR output
  • Improved address detection
  • Added more document ID formats (HTA, HAA)

v1.0 (January 2026)

  • Initial release with 5 entity types
  • 400K training samples
  • F1 Score: 95.7%

Model Details

  • Base Model: tahrirchi/tahrirchi-bert-base
  • Architecture: BERT for Token Classification
  • Labels: 7 (TEXT, PHONE, DATE, NAME, ADDRESS, DOCUMENT_ID, CARD_NUMBER)
  • Training: 5 epochs, batch size 32, learning rate 2e-5
  • Languages: Uzbek (Latin script)
  • License: CC BY-NC 4.0 (Non-Commercial)

License

This model is released under Apache license 2.0.

Attribution

When using this model, please cite:

Rubai PII Detection v1.3 by Rubai AI
https://huggingface.co/islomov/rubai-PII-detection-v1-latin

Citation

@misc{rubai-pii-detection-2025,
  author = {Sardor Islomov, Rubai AI},
  title = {Rubai PII Detection v1.3 - Uzbek Personal Information Detector},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/islomov/rubai-PII-detection-v1-latin}
}

Contact


Made with ❤️ by Rubai AI

Downloads last month
18
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train islomov/rubai-PII-detection-v1-latin