Model Card for company-name-classifier
A fine-tuned DistilBERT model that classifies whether a given text string is a person name or a company/organization name. Trained on 200,000 balanced examples, it achieves ~99.8% accuracy on a held-out test set.
Model Details
Model Description
This model is a binary text classifier built on top of distilbert-base-uncased, fine-tuned to distinguish between human personal names (e.g. "John Smith") and company or organization names (e.g. "Goldman Sachs"). It is intended to be used as a lightweight, high-accuracy name-type classifier for data enrichment, deduplication, or entity resolution pipelines.
- Developed by: John Günerli (@johngunerli)
- Funded by: N/A
- Shared by: John Günerli
- Model type: Text Classification (fine-tuned transformer)
- Language(s) (NLP): English (primarily; may generalize partially to other languages)
- License: Apache 2.0
- Finetuned from model:
distilbert-base-uncased
Model Sources
- Repository: https://github.com/johngunerli/company-name-classifier
- Model on Hub: https://huggingface.co/johngunerli/company-name-classifier
Uses
Direct Use
This model can be used out-of-the-box to classify short name strings as either a person name or a company/organization name. It is suitable for tasks such as:
- Cleaning or categorizing contact lists or CRM databases
- Entity type detection in data pipelines
- Preprocessing for downstream NER or entity resolution systems
Downstream Use
The model can be integrated as a pre-processing step in larger NLP pipelines where distinguishing people from organizations is needed — for example, before applying different enrichment or linking logic to each entity type.
Out-of-Scope Use
- Not intended for full sentences, addresses, or job titles — inputs should be short name strings only.
- Not designed for non-English names — performance may degrade on non-Western naming conventions.
- Should not be used to make high-stakes decisions without human review, particularly in cases involving ambiguous names.
Bias, Risks, and Limitations
- The training data is primarily English-language names. Non-Western or non-Latin-script names may be classified with lower accuracy.
- Ambiguous cases — such as brand names that are also personal names (e.g. "Victoria", "Jordan") — may be misclassified.
- The model is not robust to inputs that are not name strings (e.g. sentences, titles, abbreviations).
- Training data for person names is drawn from a name dataset skewed toward certain cultures and regions, which may introduce regional bias.
Recommendations
Users should validate model outputs on their specific name distributions before deploying in production. For non-English or non-Western datasets, consider fine-tuning on domain-specific data. Human review is recommended for edge cases and ambiguous inputs.
How to Get Started with the Model
from transformers import pipeline
clf = pipeline(
"text-classification",
model="johngunerli/company-name-classifier",
)
names = ["John Smith", "Apple Corporation", "Maria Garcia", "Goldman Sachs"]
results = clf(names, truncation=True, max_length=64)
for name, res in zip(names, results):
print(f"{name:<25s} → {res['label']} ({res['score']:.2%})")
Expected output:
John Smith → person (99.97%)
Apple Corporation → company (99.96%)
Maria Garcia → person (99.97%)
Goldman Sachs → company (99.95%)
Labels:
| Label | Meaning |
|---|---|
person |
A human personal name (e.g. "John Smith") |
company |
A company or organization name (e.g. "Goldman Sachs") |
Training Details
Training Data
The model was trained on 200,000 balanced examples (100,000 per class) drawn from two publicly available datasets:
- Person names:
philipperemy/name-dataset— a comprehensive collection of first and last names from 100+ countries (~491 million name records). First and last name columns were concatenated into a single "First Last" string. - Company names: People Data Labs Free Company Dataset — a publicly available dataset of ~35 million company records. Only the
namecolumn was used. Neither dataset is included in the repository. To retrain, download them separately and update the file paths in the training notebook.
Training Procedure
Fine-tuned on Google Colab using an A100 GPU for 3 epochs on 180,000 training examples (90/10 train/test split, balanced across both classes).
Preprocessing
- Person name pairs (first + last) were concatenated into a single string.
- Only the company
namecolumn was used from the company dataset. - Inputs were tokenized with a maximum sequence length of 64 tokens.
Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Base model | distilbert-base-uncased |
| Epochs | 3 |
| Train batch size | 64 |
| Learning rate | 2e-5 |
| Max sequence length | 64 tokens |
| Training examples | 180,000 |
- Training regime: fp32
Speeds, Sizes, Times
- Trained on Google Colab (A100 GPU)
- 3 epochs over 180,000 training examples
Evaluation
Testing Data, Factors & Metrics
Testing Data
Evaluated on 20,000 held-out examples (10% of the full 200,000-example dataset), balanced equally between person and company name classes.
Factors
Evaluation was performed on the overall held-out test set. No disaggregation by subpopulation, language origin, or region was reported.
Metrics
- Accuracy — proportion of correctly classified examples overall.
- F1 Score — harmonic mean of precision and recall, reported per class.
Results
| Metric | Score |
|---|---|
| Accuracy | 99.80% |
| F1 — person | 99.80% |
| F1 — company | 99.81% |
Summary
The model achieves near-perfect accuracy and F1 on the held-out test set, demonstrating strong performance on English person and company name classification.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: NVIDIA A100 (Google Colab)
- Hours used: Not reported
- Cloud Provider: Google Colab
- Compute Region: Not reported
- Carbon Emitted: Not reported
Technical Specifications
Model Architecture and Objective
Fine-tuned distilbert-base-uncased with a sequence classification head (2 output classes: person, company). The model uses the standard masked language model pre-training of DistilBERT, adapted for binary classification via a linear classification layer on top of the [CLS] token representation.
Compute Infrastructure
Google Colab (cloud-based)
Hardware
NVIDIA A100 GPU (via Google Colab)
Software
- Hugging Face Transformers
- PyTorch
- Google Colab
Citation
If you use this model, please consider citing the base model: BibTeX:
@article{Sanh2019DistilBERT,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
journal={arXiv preprint arXiv:1910.01108},
year={2019}
}
Model Card Authors
John Gunerli (@johngunerli)
Model Card Contact
See the GitHub repository for contact and contribution information.
- Downloads last month
- 52
Model tree for johngunerli/company-name-classifier
Base model
distilbert/distilbert-base-uncased