Training the tagger, parser and entity recognizer

This tutorial describes how to train new statistical models for spaCy's part-of-speech tagger, named entity recognizer and dependency parser.

I'll start with some quick code examples, that describe how to train each model. I'll then provide a bit of background about the algorithms, and explain how the data and feature templates work.

Training the part-of-speech tagger

from spacy.vocab import Vocab
from spacy.tagger import Tagger
from spacy.tokens import Doc
from spacy.gold import GoldParse


vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
tagger = Tagger(vocab)

doc = Doc(vocab, words=['I', 'like', 'stuff'])
gold = GoldParse(doc, tags=['N', 'V', 'N'])
tagger.update(doc, gold)

tagger.model.end_training()

Full example

Training the named entity recognizer

from spacy.vocab import Vocab
from spacy.pipeline import EntityRecognizer
from spacy.tokens import Doc

vocab = Vocab()
entity = EntityRecognizer(vocab, entity_types=['PERSON', 'LOC'])

doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
entity.update(doc, ['O', 'O', 'B-PERSON', 'L-PERSON', 'O'])

entity.model.end_training()

Full example

Training the dependency parser

from spacy.vocab import Vocab
from spacy.pipeline import DependencyParser
from spacy.tokens import Doc

vocab = Vocab()
parser = DependencyParser(vocab, labels=['nsubj', 'compound', 'dobj', 'punct'])

doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
parser.update(doc, [(1, 'nsubj'), (1, 'ROOT'), (3, 'compound'), (1, 'dobj'),
                    (1, 'punct')])

parser.model.end_training()

Full example

Customizing the feature extraction

spaCy currently uses linear models for the tagger, parser and entity recognizer, with weights learned using the Averaged Perceptron algorithm.

Because it's a linear model, it's important for accuracy to build conjunction features out of the atomic predictors. Let's say you have two atomic predictors asking, "What is the part-of-speech of the previous token?", and "What is the part-of-speech of the previous previous token?". These ppredictors will introduce a number of features, e.g. Prev-pos=NN, Prev-pos=VBZ, etc. A conjunction template introduces features such as Prev-pos=NN&Prev-pos=VBZ.

The feature extraction proceeds in two passes. In the first pass, we fill an array with the values of all of the atomic predictors. In the second pass, we iterate over the feature templates, and fill a small temporary array with the predictors that will be combined into a conjunction feature. Finally, we hash this array into a 64-bit integer, using the MurmurHash algorithm. You can see this at work in the thinc.linear.features module.

It's very easy to change the feature templates, to create novel combinations of the existing atomic predictors. There's currently no API available to add new atomic predictors, though. You'll have to create a subclass of the model, and write your own set_featuresC method.

The feature templates are passed in using the features keyword argument to the constructors of the Tagger , DependencyParser and EntityRecognizer :

from spacy.vocab import Vocab
from spacy.pipeline import Tagger
from spacy.tagger import P2_orth, P1_orth
from spacy.tagger import P2_cluster, P1_cluster, W_orth, N1_orth, N2_orth

vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
tagger = Tagger(vocab, features=[(P2_orth, P2_cluster), (P1_orth, P1_cluster),
                                 (P2_orth,), (P1_orth,), (W_orth,),
                                 (N1_orth,), (N2_orth,)])

Custom feature templates can be passed to the DependencyParser and EntityRecognizer as well, also using the features keyword argument of the constructor.