scikit

Annotation Specifications
Schemes used for labels, tags and training data.

Text Processing

Tokenization standards are based on the OntoNotes 5 corpus. The tokenizer differs from most by including tokens for significant whitespace. Any sequence of whitespace characters beyond a single space (' ') is included as a token. The whitespace tokens are useful for much the same reason punctuation is – it's often an important delimiter in the text. By preserving it in the token output, we are able to maintain a simple alignment between the tokens and the original string, and we ensure that no information is lost during processing.

Lemmatization

A lemma is the uninflected form of a word. The English lemmatization data is taken from WordNet. Lookup tables are taken from Lexiconista. spaCy also adds a special case for pronouns: all pronouns are lemmatized to the special token -PRON-.

Sentence boundary detection

Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries. Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text.

Part-of-speech Tagging

This section lists the fine-grained and coarse-grained part-of-speech tags assigned by spaCy's models. The individual mapping is specific to the training corpus and can be defined in the respective language data's tag_map.py.

Syntactic Dependency Parsing

This section lists the syntactic dependency labels assigned by spaCy's models. The individual labels are language-specific and depend on the training corpus.

Named Entity Recognition

Models trained on the OntoNotes 5 corpus support the following entity types:

TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LAWNamed documents made into laws.
LANGUAGEAny named language.
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including "%".
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL"first", "second", etc.
CARDINALNumerals that do not fall under another type.

Wikipedia scheme

Models trained on Wikipedia corpus (Nothman et al., 2013) use a less fine-grained NER annotation scheme and recognise the following entities:

TypeDescription
PERNamed person or family.
LOC Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains).
ORGNamed corporate, governmental, or other organizational entity.
MISC Miscellaneous entities, e.g. events, nationalities, products or works of art.

BILUO Scheme

TagDescription
B EGINThe first token of a multi-token entity.
I NAn inner token of a multi-token entity.
L ASTThe final token of a multi-token entity.
U NITA single-token entity.
O UTA non-entity token.

spaCy translates the character offsets into this scheme, in order to decide the cost of each action given the current state of the entity recogniser. The costs are then used to calculate the gradient of the loss, to train the model. The exact algorithm is a pastiche of well-known methods, and is not currently described in any single publication. The model is a greedy transition-based parser guided by a linear model whose weights are learned using the averaged perceptron loss, via the dynamic oracle imitation learning strategy. The transition system is equivalent to the BILOU tagging scheme.

Models and training data

JSON input format for training

spaCy takes training data in JSON format. The built-in convert command helps you convert the .conllu format used by the Universal Dependencies corpora to spaCy's training format.

Example structure

[{ "id": int, # ID of the document within the corpus "paragraphs": [{ # list of paragraphs in the corpus "raw": string, # raw text of the paragraph "sentences": [{ # list of sentences in the paragraph "tokens": [{ # list of tokens in the sentence "id": int, # index of the token in the document "dep": string, # dependency label "head": int, # offset of token head relative to token index "tag": string, # part-of-speech tag "orth": string, # verbatim text of the token "ner": string # BILUO label, e.g. "O" or "B-ORG" }], "brackets": [{ # phrase structure (NOT USED by current models) "first": int, # index of first token "last": int, # index of last token "label": string # phrase label }] }] }] }]

Here's an example of dependencies, part-of-speech tags and names entities, taken from the English Wall Street Journal portion of the Penn Treebank:

Can't fetch code example from GitHub :(

Please use the link below to view the example. If you've come across
a broken link, we always appreciate a pull request to the repository,
or a report on the issue tracker. Thanks!

Lexical data for vocabulary
v2.0 This feature is new and was introduced in spaCy v2.0

To populate a model's vocabulary, you can use the spacy vocab command and load in a newline-delimited JSON (JSONL) file containing one lexical entry per line. The first line defines the language and vocabulary settings. All other lines are expected to be JSON objects describing an individual lexeme. The lexical attributes will be then set as attributes on spaCy's Lexeme object. The vocab command outputs a ready-to-use spaCy model with a Vocab containing the lexical data.

First line

{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}

Entry structure

{ "orth": string, "id": int, "lower": string, "norm": string, "shape": string "prefix": string, "suffix": string, "length": int, "cluster": string, "prob": float, "is_alpha": bool, "is_ascii": bool, "is_digit": bool, "is_lower": bool, "is_punct": bool, "is_space": bool, "is_title": bool, "is_upper": bool, "like_url": bool, "like_num": bool, "like_email": bool, "is_stop": bool, "is_oov": bool, "is_quote": bool, "is_left_punct": bool, "is_right_punct": bool }

Here's an example of the 20 most frequent lexemes in the English training data:

Can't fetch code example from GitHub :(

Please use the link below to view the example. If you've come across
a broken link, we always appreciate a pull request to the repository,
or a report on the issue tracker. Thanks!