PortBERT: Navigating the Depths of Portuguese Language Models

PortBERT is a family of RoBERTa-based language models pre-trained from scratch on the Portuguese portion of OSCAR23 and MC4 (deduplicated variants of CulturaX). The models are designed to offer strong downstream performance in Portuguese NLP tasks, while providing insights into the cost-performance tradeoffs of training across hardware backends.

We release two variants:

  • PortBERT-base: 126M parameters, trained on 8Γ— A40 GPUs (fp32)
  • PortBERT-large: 357M parameters, trained on TPUv4-128 pod (fp32)

Model Details

Detail PortBERT-base PortBERT-large
Architecture RoBERTa-base RoBERTa-large
Parameters ~126M ~357M
Tokenizer GPT-2 style (52k vocab) Same
Pretraining corpus deduplicated mC4 and OSCAR 23 from CulturaX Same
Objective Masked Language Modeling Same
Training time ~27 days on 8Γ— A40 ~6.2 days on TPUv4-128 pod
Precision fp32 fp32
Framework fairseq fairseq

Downstream Evaluation (ExtraGLUE)

We evaluate PortBERT on ExtraGLUE, a Portuguese adaptation of the GLUE benchmark. Fine-tuning was conducted using HuggingFace Transformers, with NNI-based grid search over batch size and learning rate (28 configurations per task). Each task was fine-tuned for up to 10 epochs. Metrics were computed on validation sets due to the lack of held-out test sets.

AVG score averages the following metrics:

  • STSB Spearman
  • STSB Pearson
  • RTE Accuracy
  • WNLI Accuracy
  • MRPC Accuracy
  • MRPC F1

πŸ§ͺ Evaluation Results

Legend: Bold = best, italic = second-best per model size.

Model STSB_Sp STSB_Pe STSB_Mean RTE_Acc WNLI_Acc MRPC_Acc MRPC_F1 AVG
Large models
XLM-RoBERTa_large 90.00 90.27 90.14 82.31 57.75 90.44 93.31 84.01
EuroBERT-610m 88.46 88.59 88.52 78.34 59.15 91.91 94.20 83.44
PortBERT_large 88.53 88.68 88.60 72.56 61.97 89.46 92.39 82.26
BERTimbau_large 89.40 89.61 89.50 75.45 59.15 88.24 91.55 82.23
Base models
RoBERTaLexPT_base 86.68 86.86 86.77 69.31 59.15 89.46 92.34 80.63
PortBERT_base 87.39 87.65 87.52 68.95 60.56 87.75 91.13 80.57
RoBERTaCrawlPT_base 87.34 87.45 87.39 72.56 56.34 87.99 91.20 80.48
BERTimbau_base 88.39 88.60 88.50 70.40 56.34 87.25 90.97 80.32
XLM-RoBERTa_base 85.75 86.09 85.92 68.23 60.56 87.75 91.32 79.95
EuroBERT-210m 86.54 86.62 86.58 65.70 57.75 87.25 91.00 79.14
AlBERTina 100M PTPT 86.52 86.51 86.52 70.04 56.34 85.05 89.57 79.01
AlBERTina 100M PTBR 85.97 85.99 85.98 68.59 56.34 85.78 89.82 78.75
AiBERTa 83.56 83.73 83.65 64.98 56.34 82.11 86.99 76.29
roBERTa PT 48.06 48.51 48.29 56.68 59.15 72.06 81.79 61.04

Fairseq Checkpoint

Get the fairseq checkpoint here.

Citations

If you use PortBERT in your research, please cite the following paper:

@book{scheible-schmitt-etal-2025-portbert,
  author = {\textbf{Scheible-Schmitt}, \textbf{Raphael} and He, Henry and Mendes, Armando B.},
  title = {PortBERT: Navigating the Depths of Portuguese Language Models},
  booktitle = {Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models},
  month = {September},
  year = {2025},
  address = {Varna, Bulgaria},
  publisher = {INCOMA Ltd., Shoumen, BULGARIA},
  pages = {59--71},
  abstract = {Transformer models dominate modern NLP, but efficient, language-specific models remain scarce. In Portuguese, most focus on scale or accuracy, often neglecting training and deployment efficiency. In the present work, we introduce PortBERT, a family of RoBERTa-based language models for Portuguese, designed to balance performance and efficiency. Trained from scratch on over 450 GB of deduplicated and filtered mC4 and OSCAR23 from CulturaX using fairseq, PortBERT leverages byte-level BPE tokenization and stable pre-training routines across both GPU and TPU processors. We release two variants, PortBERT base and PortBERT large, and evaluate them on ExtraGLUE, a suite of translated GLUE and SuperGLUE tasks. Both models perform competitively, matching or surpassing existing monolingual and multilingual models. Beyond accuracy, we report training and inference times as well as fine-tuning throughput, providing practical insights into model efficiency. PortBERT thus complements prior work by addressing the underexplored dimension of compute-performance tradeoffs in Portuguese NLP. We release all models on Huggingface and provide fairseq checkpoints to support further research and applications.},
  url = {https://aclanthology.org/2025.globalnlp-1.8},
  doi = {https://doi.org/10.26615/978-954-452-105-9-008}
}

πŸ“œ License

MIT License

Downloads last month
15
Safetensors
Model size
0.1B params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train PortBERT/PortBERT_base