Training spaCy's statistical models

This workflow describes how to train new statistical models for spaCy's part-of-speech tagger, named entity recognizer and dependency parser. Once the model is trained, you can then save and load it.

Training the part-of-speech tagger

from spacy.vocab import Vocab
from spacy.tagger import Tagger
from spacy.tokens import Doc
from spacy.gold import GoldParse


vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
tagger = Tagger(vocab)

doc = Doc(vocab, words=['I', 'like', 'stuff'])
gold = GoldParse(doc, tags=['N', 'V', 'N'])
tagger.update(doc, gold)

tagger.model.end_training()

Full example

Training the named entity recognizer

from spacy.vocab import Vocab
from spacy.pipeline import EntityRecognizer
from spacy.tokens import Doc

vocab = Vocab()
entity = EntityRecognizer(vocab, entity_types=['PERSON', 'LOC'])

doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
entity.update(doc, ['O', 'O', 'B-PERSON', 'L-PERSON', 'O'])

entity.model.end_training()

Full example

Extending the named entity recognizer

All spaCy models support online learning, so you can update a pre-trained model with new examples. You can even add new classes to an existing model, to recognise a new entity type, part-of-speech, or syntactic relation. Updating an existing model is particularly useful as a "quick and dirty solution", if you have only a few corrections or annotations.

Full exampleUsage Workflow

Training the dependency parser

from spacy.vocab import Vocab
from spacy.pipeline import DependencyParser
from spacy.tokens import Doc

vocab = Vocab()
parser = DependencyParser(vocab, labels=['nsubj', 'compound', 'dobj', 'punct'])

doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
parser.update(doc, [(1, 'nsubj'), (1, 'ROOT'), (3, 'compound'), (1, 'dobj'),
                    (1, 'punct')])

parser.model.end_training()

Full example

Customizing the feature extraction

spaCy currently uses linear models for the tagger, parser and entity recognizer, with weights learned using the Averaged Perceptron algorithm.

Because it's a linear model, it's important for accuracy to build conjunction features out of the atomic predictors. Let's say you have two atomic predictors asking, "What is the part-of-speech of the previous token?", and "What is the part-of-speech of the previous previous token?". These predictors will introduce a number of features, e.g. Prev-pos=NN, Prev-pos=VBZ, etc. A conjunction template introduces features such as Prev-pos=NN&Prev-pos=VBZ.

The feature extraction proceeds in two passes. In the first pass, we fill an array with the values of all of the atomic predictors. In the second pass, we iterate over the feature templates, and fill a small temporary array with the predictors that will be combined into a conjunction feature. Finally, we hash this array into a 64-bit integer, using the MurmurHash algorithm. You can see this at work in the thinc.linear.features module.

It's very easy to change the feature templates, to create novel combinations of the existing atomic predictors. There's currently no API available to add new atomic predictors, though. You'll have to create a subclass of the model, and write your own set_featuresC method.

The feature templates are passed in using the features keyword argument to the constructors of the Tagger , DependencyParser and EntityRecognizer :

from spacy.vocab import Vocab
from spacy.pipeline import Tagger
from spacy.tagger import P2_orth, P1_orth
from spacy.tagger import P2_cluster, P1_cluster, W_orth, N1_orth, N2_orth

vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
tagger = Tagger(vocab, features=[(P2_orth, P2_cluster), (P1_orth, P1_cluster),
                                 (P2_orth,), (P1_orth,), (W_orth,),
                                 (N1_orth,), (N2_orth,)])

Custom feature templates can be passed to the DependencyParser and EntityRecognizer as well, also using the features keyword argument of the constructor.