Model Card for company-name-classifier

A fine-tuned DistilBERT model that classifies whether a given text string is a person name or a company/organization name. Trained on 200,000 balanced examples, it achieves ~99.8% accuracy on a held-out test set.

Model Details

Model Description

This model is a binary text classifier built on top of distilbert-base-uncased, fine-tuned to distinguish between human personal names (e.g. "John Smith") and company or organization names (e.g. "Goldman Sachs"). It is intended to be used as a lightweight, high-accuracy name-type classifier for data enrichment, deduplication, or entity resolution pipelines.

Developed by: John Günerli (@johngunerli)
Funded by: N/A
Shared by: John Günerli
Model type: Text Classification (fine-tuned transformer)
Language(s) (NLP): English (primarily; may generalize partially to other languages)
License: Apache 2.0
Finetuned from model: distilbert-base-uncased

Model Sources

Repository: https://github.com/johngunerli/company-name-classifier
Model on Hub: https://huggingface.co/johngunerli/company-name-classifier

Uses

Direct Use

This model can be used out-of-the-box to classify short name strings as either a person name or a company/organization name. It is suitable for tasks such as:

Cleaning or categorizing contact lists or CRM databases
Entity type detection in data pipelines
Preprocessing for downstream NER or entity resolution systems

Downstream Use

The model can be integrated as a pre-processing step in larger NLP pipelines where distinguishing people from organizations is needed — for example, before applying different enrichment or linking logic to each entity type.

Out-of-Scope Use

Not intended for full sentences, addresses, or job titles — inputs should be short name strings only.
Not designed for non-English names — performance may degrade on non-Western naming conventions.
Should not be used to make high-stakes decisions without human review, particularly in cases involving ambiguous names.

Bias, Risks, and Limitations

The training data is primarily English-language names. Non-Western or non-Latin-script names may be classified with lower accuracy.
Ambiguous cases — such as brand names that are also personal names (e.g. "Victoria", "Jordan") — may be misclassified.
The model is not robust to inputs that are not name strings (e.g. sentences, titles, abbreviations).
Training data for person names is drawn from a name dataset skewed toward certain cultures and regions, which may introduce regional bias.

Recommendations

Users should validate model outputs on their specific name distributions before deploying in production. For non-English or non-Western datasets, consider fine-tuning on domain-specific data. Human review is recommended for edge cases and ambiguous inputs.

How to Get Started with the Model

from transformers import pipeline
clf = pipeline(
    "text-classification",
    model="johngunerli/company-name-classifier",
)
names = ["John Smith", "Apple Corporation", "Maria Garcia", "Goldman Sachs"]
results = clf(names, truncation=True, max_length=64)
for name, res in zip(names, results):
    print(f"{name:<25s} → {res['label']} ({res['score']:.2%})")

Expected output:

John Smith                → person  (99.97%)
Apple Corporation         → company (99.96%)
Maria Garcia              → person  (99.97%)
Goldman Sachs             → company (99.95%)

Labels:

Label	Meaning
`person`	A human personal name (e.g. "John Smith")
`company`	A company or organization name (e.g. "Goldman Sachs")

Training Details

Training Data

The model was trained on 200,000 balanced examples (100,000 per class) drawn from two publicly available datasets:

Person names: philipperemy/name-dataset — a comprehensive collection of first and last names from 100+ countries (~491 million name records). First and last name columns were concatenated into a single "First Last" string.
Company names: People Data Labs Free Company Dataset — a publicly available dataset of ~35 million company records. Only the name column was used. Neither dataset is included in the repository. To retrain, download them separately and update the file paths in the training notebook.

Training Procedure

Fine-tuned on Google Colab using an A100 GPU for 3 epochs on 180,000 training examples (90/10 train/test split, balanced across both classes).

Preprocessing

Person name pairs (first + last) were concatenated into a single string.
Only the company name column was used from the company dataset.
Inputs were tokenized with a maximum sequence length of 64 tokens.

Training Hyperparameters

Hyperparameter	Value
Base model	`distilbert-base-uncased`
Epochs	3
Train batch size	64
Learning rate	2e-5
Max sequence length	64 tokens
Training examples	180,000

Training regime: fp32

Speeds, Sizes, Times

Trained on Google Colab (A100 GPU)
3 epochs over 180,000 training examples

Evaluation

Testing Data, Factors & Metrics

Testing Data

Evaluated on 20,000 held-out examples (10% of the full 200,000-example dataset), balanced equally between person and company name classes.

Factors

Evaluation was performed on the overall held-out test set. No disaggregation by subpopulation, language origin, or region was reported.

Metrics

Accuracy — proportion of correctly classified examples overall.
F1 Score — harmonic mean of precision and recall, reported per class.

Results

Metric	Score
Accuracy	99.80%
F1 — person	99.80%
F1 — company	99.81%

Summary

The model achieves near-perfect accuracy and F1 on the held-out test set, demonstrating strong performance on English person and company name classification.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA A100 (Google Colab)
Hours used: Not reported
Cloud Provider: Google Colab
Compute Region: Not reported
Carbon Emitted: Not reported

Technical Specifications

Model Architecture and Objective

Fine-tuned distilbert-base-uncased with a sequence classification head (2 output classes: person, company). The model uses the standard masked language model pre-training of DistilBERT, adapted for binary classification via a linear classification layer on top of the [CLS] token representation.

Compute Infrastructure

Google Colab (cloud-based)

Hardware

NVIDIA A100 GPU (via Google Colab)

Software

Hugging Face Transformers
PyTorch
Google Colab

Citation

If you use this model, please consider citing the base model: BibTeX:

@article{Sanh2019DistilBERT,
  title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
  author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
  journal={arXiv preprint arXiv:1910.01108},
  year={2019}
}