File size: 1,819 Bytes
6537858 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# NILC Portuguese Word Embeddings β FastText Skip-Gram 600d
Pretrained **static word embeddings** for **Portuguese** (Brazilian + European), trained by the [NILC group](http://nilc.icmc.usp.br/) on a large multi-genre corpus (~1.39B tokens, 17 sources).
This repository contains the **FastText Skip-Gram 600d** model in safetensors format.
---
## π Files
- `embeddings.safetensors` β word vectors (`[vocab_size, 600]`)
- `vocab.txt` β vocabulary (one token per line, aligned with rows)
---
## π Usage
```python
from safetensors.numpy import load_file
data = load_file("embeddings.safetensors")
vectors = data["embeddings"]
with open("vocab.txt") as f:
vocab = [w.strip() for w in f]
word2idx = {w: i for i, w in enumerate(vocab)}
print(vectors[word2idx["rei"]]) # vector for "rei"
```
Or in PyTorch:
```python
from safetensors.torch import load_file
tensors = load_file("embeddings.safetensors")
vectors = tensors["embeddings"] # torch.Tensor
```
---
## π Reference
```bibtex
@inproceedings{hartmann-etal-2017-portuguese,
title = {{P}ortuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks},
author = {Hartmann, Nathan and Fonseca, Erick and Shulby, Christopher and Treviso, Marcos and Silva, J{'e}ssica and Alu{'i}sio, Sandra},
year = 2017,
month = oct,
booktitle = {Proceedings of the 11th {B}razilian Symposium in Information and Human Language Technology},
publisher = {Sociedade Brasileira de Computa{\c{c}}{\~a}o},
address = {Uberl{\^a}ndia, Brazil},
pages = {122--131},
url = {https://aclanthology.org/W17-6615/},
editor = {Paetzold, Gustavo Henrique and Pinheiro, Vl{'a}dia}
}
```
---
## π License
Creative Commons Attribution 4.0 International (CC BY 4.0)
|