Google's SyntaxNet in Python NLTK
In May 2016 Google released SyntaxNet, a syntactic parser whose performance beat previously proposed approaches.
In this post I will show you how to have SyntaxNet’s syntactic dependencies and other morphological information in Python, precisely how to load NLTK structures such as DependencyGraph and Tree with SyntaxNet’s output.
In this example will use the Portuguese model, but as you will see this can be easily adapted to any language, provided you have already a pre-trained model.
Setup
First you need to install SyntaxNet:
https://github.com/tensorflow/models/tree/master/syntaxnet
Then, you need to download a pre-trained model, from the list of all the available models
http://download.tensorflow.org/models/parsey_universal/<language>.zip
As the authors show in the tutorial after installing SyntaxNet and downloading a pre-trained model, one can parse a sentence with the following command:
MODEL_DIRECTORY=/where/you/unzipped/the/model/files
cat sentences.txt | syntaxnet/models/parsey_universal/parse.sh \
$MODEL_DIRECTORY > output.conll
Now I will show you how to parse a file with a sentence per line and use it within Python NLTK.
cat sentences.txt
Quase 900 funcionários do Departamento de Estado assinaram memorando \
que critica Trump.
Meo, Nos e Vodafone arriscam-se a ter de baixar preços a milhões \
de clientes.
First, we load all the sentences into a list and joined them into a single string separated by the newline ‘\n’ character.
Then we will use python subprocess to call SyntaxNet, process the loaded sentences, and fetch the parsed sentences from stdout.
We process the captured stdout, for each token, the dependencies and other morphological information. Each token is represented by a list with all its syntactic and morphologic information. A list of lists makes the sentence.
We then join each word/token information in a string separated by a ‘\tab’ character, each word/token in a different line.
We then pass this string into the NLTK’s DependenccyGraph and can then see all the dependency triples or an ASCII print of the tree.
For the first sentence we have the following triples and tree:
((u'assinaram', u'VERB'), u'nsubj', (u'funcion\xe1rios', u'NOUN'))
((u'funcion\xe1rios', u'NOUN'), u'nummod', (u'900', u'NUM'))
((u'900', u'NUM'), u'advmod', (u'Quase', u'ADV'))
((u'funcion\xe1rios', u'NOUN'), u'name', (u'Departamento', u'PROPN'))
((u'Departamento', u'PROPN'), u'case', (u'do', u'ADP'))
((u'funcion\xe1rios', u'NOUN'), u'name', (u'Estado', u'PROPN'))
((u'Estado', u'PROPN'), u'case', (u'de', u'ADP'))
((u'assinaram', u'VERB'), u'ccomp', (u'memorando', u'VERB'))
((u'memorando', u'VERB'), u'ccomp', (u'critica', u'VERB'))
((u'critica', u'VERB'), u'mark', (u'que', u'SCONJ'))
((u'critica', u'VERB'), u'dobj', (u'Trump.', u'PROPN'))
assinaram
___________|_____________
funcionários memorando
________|___________ |
900 Departamento Estado critica
| | | ______|_______
Quase do de que Trump.
And for the second sentence:
((u'pode', u'VERB'), u'nsubj', (u'galinha', u'NOUN'))
((u'galinha', u'NOUN'), u'det', (u'Uma', u'DET'))
((u'pode', u'VERB'), u'dobj', (u'ovos', u'NOUN'))
((u'ovos', u'NOUN'), u'case', (u'por', u'ADP'))
((u'ovos', u'NOUN'), u'nummod', (u'250', u'NUM'))
((u'ovos', u'NOUN'), u'nmod', (u'ano.', u'NOUN'))
((u'ano.', u'NOUN'), u'case', (u'por', u'ADP'))
pode
_________|____
| ovos
| ________|____
galinha | | ano.
| | | |
Uma por 250 por
I’m still trying to figure out how to have SyntaxNet running as a daemon or service, where we can give a sentence and have as a result, for instance, a JSON object with the syntactic and morphologic information.
Related posts
SyntaxNet
NLTK
dependency-graph
pos-tags
syntactic-dependencies