Portuguese Word Embeddings
While working on some projects of mine I come to a point where I needed pre-trained word embeddings for Portuguese. I could have trained some on my own on some corpora but I did not want to spend time on cleaning and running the training, so instead I searched the web for collections of word vectors for Portuguese, here’s a compiled list of what I’ve found.
NILC-Embeddings (2017)
A very comprehensive evaluation of different methods and parameters to generate word embeddings for both Brazilian and European variants. In total 31 word embedding models based on FastText, GloVe, Wang2Vec and Word2Vec, evaluated intrinsically on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks.
- Domain: Mixed (News, Wiki, Subtitles, literary works, etc.)
- Methods: FastText, GloVe, Wang2Vec and Word2Vec
- “Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language […]”
- Download
- Code
LX-DSemVectors (2018)
The authors apply the Skip-Gram model to a dataset composed of mostly European Portuguese newspapers. I would say that if you want embeddings for the new domain in European Portuguese this is probably a very good choice.
- Domain: News Articles
- Method: Skip-Gram
- “Finely Tuned, 2 Billion Token Based Word Embeddings for Portuguese”
- Download
- Code
Facebook fasttext (2018)
This is the famous dataset published by Facebook research containing word embeddings trained on the Wikipedia and Common Crawl data. It contains Portuguese among a total of 157 languages.
- Methods: FastText, GloVe, Wang2Vec and Word2Vec
- Domain: Wikipedia + Common Crawl
- “Learning Word Vectors for 157 Languages”
- Download
- Code
Wikipedia2Vec (2018)
Unlike other word embedding tools, this software package learns embeddings of entities as well as words, the method jointly maps words and entities into the same continuous vector space. They provide such embeddings for 11 Languages, including Portuguese.
- Methods: Word2Vec
- Domain: Wikipedia
- “Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation”
- Download
- Code
NLPL word embeddings repository
The paper states: “a shared repository of large-text resources for creating word vectors, including pre-processed corpora and pre-trained vectors for a range of frameworks and configurations. This will facilitate reuse, rapid experimentation, and replicability of results”. The repository contains different types of embedding for many languages, including embeddings based on the Portuguese CoNLL17 corpus.
- Methods: Several
- Domain: Several
- “Word vectors, reuse, and replicability: Towards a community repository […]”
- Download
embeddings
word-embeddings
gensim
fasttext
word2vec
portuguese