Linear Model Feature Scheme

There are two popular strategies for putting together machine learning models for NLP: sparse linear models, and neural networks. To solve NLP problems with linear models, feature templates need to be assembled that combine multiple atomic predictors. This page documents the atomic predictors used in the spaCy 1.0 Parser , Tagger and EntityRecognizer .

To understand the scheme, recall that spaCy's Parser and EntityRecognizer are implemented as push-down automata. They maintain a "stack" that holds the current entity, and a "buffer" consisting of the words to be processed.

Each state consists of the words on the stack (if any), which consistute the current entity being constructed. We also have the current word, and the two subsequent words. Finally, we also have the entities previously built.

This gives us a number of tokens to ask questions about, to make the features. About each of these tokens, we can ask about a number of different properties. Each feature identifier asks about a specific property of a specific token of the context.

Context tokens

S0 The first word on the stack, i.e. the token most recently added to the current entity.
S1The second word on the stack, i.e. the second most recently added.
S2The third word on the stack, i.e. the third most recently added.
N0The first word of the buffer, i.e. the current word being tagged.
N1The second word of the buffer.
N2The third word of the buffer.
P1The word immediately before N0.
P2The second word before N0.
E0The first word of the previously constructed entity.
E1The first word of the second previously constructed entity.

About each of these tokens, we can ask:

N0wtoken.orthThe word form.
N0Wtoken.lemmaThe word's lemma.
N0ptoken.tagThe word's (full) POS tag.
N0ctoken.clusterThe word's (full) Brown cluster.
N0c4-First four digit prefix of the word's Brown cluster.
N0c6-First six digit prefix of the word's Brown cluster.
N0L-The word's dependency label. Not used as a feature in the NER.
N0_prefixtoken.prefixThe first three characters of the word.
N0_suffixtoken.suffixThe last three characters of the word.
N0_shapetoken.shapeThe word's shape, i.e. is it alphabetic, numeric, etc.
N0_ne_iobtoken.ent_iobThe Inside/Outside/Begin code of the word's NER tag.
N0_ne_typetoken.ent_typeThe word's NER type.