Using the dependency parse

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks".

You can check whether a Doc object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception.

The displaCy visualizer

The best way to understand spaCy's dependency parser is interactively, through the displaCy visualizer. If you want to know how to write rules that hook into some type of syntactic construction, just plug the sentence into the visualizer and see how spaCy annotates it.

Navigating the parse tree

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of token.dep is an integer. You can get the string value with token.dep_.

Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest — from below:

from spacy.symbols import nsubj, VERB
# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)

If you try to match from above, you'll have to iterate twice: once for the head, and then again through the children:

# Finding a verb with a subject from above — less good
verbs = []
for possible_verb in doc:
    if possible_verb.pos == VERB:
        for possible_subject in possible_verb.children:
            if possible_subject.dep == nsubj:
                verbs.append(possible_verb)
                break

To iterate through the children, use the token.children attribute, which provides a sequence of Token objects.

A few more convenience attributes are provided for iterating around the local tree from the token. The .lefts and .rights attributes provide sequences of syntactic children that occur before and after the token. Both sequences are in sentences order. There are also two integer-typed attributes, .n_rights and .n_lefts, that give the number of left and right children.

You can get a whole phrase by its syntactic head using the .subtree attribute. This returns an ordered sequence of tokens. For the default English model, the parse tree is projective, which means that there are no crossing brackets. The tokens returned by .subtree are therefore guaranteed to be contiguous. This is not true for the German model, which has many non-projective dependencies. You can walk up the tree with the .ancestors attribute, and check dominance with the .is_ancestor() method.

Finally, I often find the .left_edge and right_edge attributes especially useful. They give you the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase — a useful operation.

Note that .right_edge gives a token within the subtree — so if you use it as the end-point of a range, don't forget to +1!

Disabling the parser

The parser is loaded and enabled by default. If you don't need any of the syntactic information, you should disable the parser. Disabling the parser will make spaCy load and run much faster. Here's how to prevent the parser from being loaded:

import spacy

nlp = spacy.load('en', parser=False)

If you need to load the parser, but need to disable it for specific documents, you can control its use with the parse keyword argument:

nlp = spacy.load('en')
doc1 = nlp(u'Text I do want parsed.')
doc2 = nlp(u"Text I don't want parsed", parse=False)