Linear Model Feature Scheme
There are two popular strategies for putting together machine learning
models for NLP: sparse linear models, and neural networks. To solve NLP
problems with linear models, feature templates need to be assembled that
combine multiple atomic predictors. This page documents the atomic predictors used in the spaCy 1.0
To understand the scheme, recall that spaCy's
EntityRecognizer are implemented as push-down automata. They maintain a "stack" that holds the current entity, and a "buffer"
consisting of the words to be processed.
Each state consists of the words on the stack (if any), which consistute the current entity being constructed. We also have the current word, and the two subsequent words. Finally, we also have the entities previously built.
This gives us a number of tokens to ask questions about, to make the features. About each of these tokens, we can ask about a number of different properties. Each feature identifier asks about a specific property of a specific token of the context.
|The first word on the stack, i.e. the token most recently added to the current entity.|
|The second word on the stack, i.e. the second most recently added.|
|The third word on the stack, i.e. the third most recently added.|
|The first word of the buffer, i.e. the current word being tagged.|
|The second word of the buffer.|
|The third word of the buffer.|
|The word immediately before |
|The second word before |
|The first word of the previously constructed entity.|
|The first word of the second previously constructed entity.|
About each of these tokens, we can ask:
|The word form.|
|The word's lemma.|
|The word's (full) POS tag.|
|The word's (full) Brown cluster.|
|-||First four digit prefix of the word's Brown cluster.|
|-||First six digit prefix of the word's Brown cluster.|
|-||The word's dependency label. Not used as a feature in the NER.|
|The first three characters of the word.|
|The last three characters of the word.|
|The word's shape, i.e. is it alphabetic, numeric, etc.|
|The Inside/Outside/Begin code of the word's NER tag.|
|The word's NER type.|