Portuguese Word Embeddings

While working on some projects of mine I come to a point where I needed pre-trained word embeddings for Portuguese. I could have trained some on my own on some corpora but I did not want to spend time on cleaning and running the training, so instead I searched the web for collections of word vectors for Portuguese, here’s a compiled list of what I’ve found.

NILC-Embeddings (2017)

A very comprehensive evaluation of different methods and parameters to generate word embeddings for both Brazilian and European variants. In total 31 word embedding models based on FastText, GloVe, Wang2Vec and Word2Vec, evaluated intrinsically on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks.

Domain: Mixed (News, Wiki, Subtitles, literary works, etc.)
Methods: FastText, GloVe, Wang2Vec and Word2Vec
“Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language […]”
Download
Code

LX-DSemVectors (2018)

The authors apply the Skip-Gram model to a dataset composed of mostly European Portuguese newspapers. I would say that if you want embeddings for the new domain in European Portuguese this is probably a very good choice.

Domain: News Articles
Method: Skip-Gram
“Finely Tuned, 2 Billion Token Based Word Embeddings for Portuguese”
Download
Code

Facebook fasttext (2018)

This is the famous dataset published by Facebook research containing word embeddings trained on the Wikipedia and Common Crawl data. It contains Portuguese among a total of 157 languages.

Methods: FastText, GloVe, Wang2Vec and Word2Vec
Domain: Wikipedia + Common Crawl
“Learning Word Vectors for 157 Languages”
Download
Code

Wikipedia2Vec (2018)

Unlike other word embedding tools, this software package learns embeddings of entities as well as words, the method jointly maps words and entities into the same continuous vector space. They provide such embeddings for 11 Languages, including Portuguese.

Methods: Word2Vec
Domain: Wikipedia
“Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation”
Download
Code

NLPL word embeddings repository

The paper states: “a shared repository of large-text resources for creating word vectors, including pre-processed corpora and pre-trained vectors for a range of frameworks and configurations. This will facilitate reuse, rapid experimentation, and replicability of results”. The repository contains different types of embedding for many languages, including embeddings based on the Portuguese CoNLL17 corpus.

Methods: Several
Domain: Several
“Word vectors, reuse, and replicability: Towards a community repository […]”
Download