Using the dependency parse
spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks".
You can check whether a
Doc object has been parsed with the
doc.is_parsed attribute, which returns a boolean value. If this attribute is
False, the default sentence iterator will raise an exception.
The displaCy visualizer
The best way to understand spaCy's dependency parser is interactively, through the displaCy visualizer. If you want to know how to write rules that hook into some type of syntactic construction, just plug the sentence into the visualizer and see how spaCy annotates it.
Navigating the parse tree
spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation
that connects the child to the head. As with other attributes, the value of
token.dep is an integer. You can get the string value with
Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest — from below:
from spacy.symbols import nsubj, VERB # Finding a verb with a subject from below — good verbs = set() for possible_subject in doc: if possible_subject.dep == nsubj and possible_subject.head.pos == VERB: verbs.add(possible_subject.head)
If you try to match from above, you'll have to iterate twice: once for the head, and then again through the children:
# Finding a verb with a subject from above — less good verbs =  for possible_verb in doc: if possible_verb.pos == VERB: for possible_subject in possible_verb.children: if possible_subject.dep == nsubj: verbs.append(possible_verb) break
To iterate through the children, use the
token.children attribute, which provides a sequence of
A few more convenience attributes are provided for iterating around the local tree from the token. The
.rights attributes provide sequences of syntactic children that occur before and
after the token. Both sequences are in sentences order. There are also two integer-typed attributes,
.n_lefts, that give the number of left and right children.
You can get a whole phrase by its syntactic head using the
.subtree attribute. This returns an ordered sequence of tokens. For the default English model, the parse tree is projective, which means that there are no crossing brackets. The tokens returned by
.subtree are therefore guaranteed to be contiguous. This is not true for the German model, which has many non-projective dependencies. You can walk up the tree with the
.ancestors attribute, and check dominance with the
Finally, I often find the
right_edge attributes especially useful. They give you the first and last token of the subtree. This is the easiest way to create a
Span object for a syntactic phrase — a useful operation.
.right_edge gives a token within the subtree — so if you use it as the end-point of a range, don't forget to
Disabling the parser
The parser is loaded and enabled by default. If you don't need any of the syntactic information, you should disable the parser. Disabling the parser will make spaCy load and run much faster. Here's how to prevent the parser from being loaded:
import spacy nlp = spacy.load('en', parser=False)
If you need to load the parser, but need to disable it for specific documents, you can control its use with the
parse keyword argument:
nlp = spacy.load('en') doc1 = nlp(u'Text I do want parsed.') doc2 = nlp(u"Text I don't want parsed", parse=False)