PortBERT: Navigating the Depths of Portuguese Language Models
PortBERT is a family of RoBERTa-based language models pre-trained from scratch on the Portuguese portion of OSCAR23 and MC4 (deduplicated variants of CulturaX). The models are designed to offer strong downstream performance in Portuguese NLP tasks, while providing insights into the cost-performance tradeoffs of training across hardware backends.
We release two variants:
PortBERT-base: 126M parameters, trained on 8Γ A40 GPUs (fp32)PortBERT-large: 357M parameters, trained on TPUv4-128 pod (fp32)
Model Details
| Detail | PortBERT-base | PortBERT-large |
|---|---|---|
| Architecture | RoBERTa-base | RoBERTa-large |
| Parameters | ~126M | ~357M |
| Tokenizer | GPT-2 style (52k vocab) | Same |
| Pretraining corpus | deduplicated mC4 and OSCAR 23 from CulturaX | Same |
| Objective | Masked Language Modeling | Same |
| Training time | ~27 days on 8Γ A40 | ~6.2 days on TPUv4-128 pod |
| Precision | fp32 | fp32 |
| Framework | fairseq | fairseq |
Downstream Evaluation (ExtraGLUE)
We evaluate PortBERT on ExtraGLUE, a Portuguese adaptation of the GLUE benchmark. Fine-tuning was conducted using HuggingFace Transformers, with NNI-based grid search over batch size and learning rate (28 configurations per task). Each task was fine-tuned for up to 10 epochs. Metrics were computed on validation sets due to the lack of held-out test sets.
AVG score averages the following metrics:
- STSB Spearman
- STSB Pearson
- RTE Accuracy
- WNLI Accuracy
- MRPC Accuracy
- MRPC F1
π§ͺ Evaluation Results
Legend: Bold = best, italic = second-best per model size.
| Model | STSB_Sp | STSB_Pe | STSB_Mean | RTE_Acc | WNLI_Acc | MRPC_Acc | MRPC_F1 | AVG |
|---|---|---|---|---|---|---|---|---|
| Large models | ||||||||
| XLM-RoBERTa_large | 90.00 | 90.27 | 90.14 | 82.31 | 57.75 | 90.44 | 93.31 | 84.01 |
| EuroBERT-610m | 88.46 | 88.59 | 88.52 | 78.34 | 59.15 | 91.91 | 94.20 | 83.44 |
| PortBERT_large | 88.53 | 88.68 | 88.60 | 72.56 | 61.97 | 89.46 | 92.39 | 82.26 |
| BERTimbau_large | 89.40 | 89.61 | 89.50 | 75.45 | 59.15 | 88.24 | 91.55 | 82.23 |
| Base models | ||||||||
| RoBERTaLexPT_base | 86.68 | 86.86 | 86.77 | 69.31 | 59.15 | 89.46 | 92.34 | 80.63 |
| PortBERT_base | 87.39 | 87.65 | 87.52 | 68.95 | 60.56 | 87.75 | 91.13 | 80.57 |
| RoBERTaCrawlPT_base | 87.34 | 87.45 | 87.39 | 72.56 | 56.34 | 87.99 | 91.20 | 80.48 |
| BERTimbau_base | 88.39 | 88.60 | 88.50 | 70.40 | 56.34 | 87.25 | 90.97 | 80.32 |
| XLM-RoBERTa_base | 85.75 | 86.09 | 85.92 | 68.23 | 60.56 | 87.75 | 91.32 | 79.95 |
| EuroBERT-210m | 86.54 | 86.62 | 86.58 | 65.70 | 57.75 | 87.25 | 91.00 | 79.14 |
| AlBERTina 100M PTPT | 86.52 | 86.51 | 86.52 | 70.04 | 56.34 | 85.05 | 89.57 | 79.01 |
| AlBERTina 100M PTBR | 85.97 | 85.99 | 85.98 | 68.59 | 56.34 | 85.78 | 89.82 | 78.75 |
| AiBERTa | 83.56 | 83.73 | 83.65 | 64.98 | 56.34 | 82.11 | 86.99 | 76.29 |
| roBERTa PT | 48.06 | 48.51 | 48.29 | 56.68 | 59.15 | 72.06 | 81.79 | 61.04 |
Fairseq Checkpoint
Get the fairseq checkpoint here.
Citations
If you use PortBERT in your research, please cite the following paper:
@book{scheible-schmitt-etal-2025-portbert,
author = {\textbf{Scheible-Schmitt}, \textbf{Raphael} and He, Henry and Mendes, Armando B.},
title = {PortBERT: Navigating the Depths of Portuguese Language Models},
booktitle = {Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models},
month = {September},
year = {2025},
address = {Varna, Bulgaria},
publisher = {INCOMA Ltd., Shoumen, BULGARIA},
pages = {59--71},
abstract = {Transformer models dominate modern NLP, but efficient, language-specific models remain scarce. In Portuguese, most focus on scale or accuracy, often neglecting training and deployment efficiency. In the present work, we introduce PortBERT, a family of RoBERTa-based language models for Portuguese, designed to balance performance and efficiency. Trained from scratch on over 450 GB of deduplicated and filtered mC4 and OSCAR23 from CulturaX using fairseq, PortBERT leverages byte-level BPE tokenization and stable pre-training routines across both GPU and TPU processors. We release two variants, PortBERT base and PortBERT large, and evaluate them on ExtraGLUE, a suite of translated GLUE and SuperGLUE tasks. Both models perform competitively, matching or surpassing existing monolingual and multilingual models. Beyond accuracy, we report training and inference times as well as fine-tuning throughput, providing practical insights into model efficiency. PortBERT thus complements prior work by addressing the underexplored dimension of compute-performance tradeoffs in Portuguese NLP. We release all models on Huggingface and provide fairseq checkpoints to support further research and applications.},
url = {https://aclanthology.org/2025.globalnlp-1.8},
doi = {https://doi.org/10.26615/978-954-452-105-9-008}
}
π License
MIT License
- Downloads last month
- 15