Annotation Specifications

This document describes the target annotations spaCy is trained to predict.

Tokenization

Tokenization standards are based on the OntoNotes 5 corpus. The tokenizer differs from most by including tokens for significant whitespace. Any sequence of whitespace characters beyond a single space (' ') is included as a token.

The whitespace tokens are useful for much the same reason punctuation is – it's often an important delimiter in the text. By preserving it in the token output, we are able to maintain a simple alignment between the tokens and the original string, and we ensure that no information is lost during processing.

Sentence boundary detection

Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries. Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text.

Part-of-speech Tagging

The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS tag set.

English part-of-speech tag scheme

TagPOSMorphology
-LRB- PUNCT PunctType=brck PunctSide=ini
-PRB- PUNCT PunctType=brck PunctSide=fin
, PUNCT PunctType=comm
: PUNCT
. PUNCT PunctType=peri
'' PUNCT PunctType=quot PunctSide=fin
"" PUNCT PunctType=quot PunctSide=fin
# SYM SymType=numbersign
`` PUNCT PunctType=quot PunctSide=ini
$ SYM SymType=currency
ADD X
AFX ADJ Hyph=yes
BES VERB
CC CONJ ConjType=coor
CD NUM NumType=card
DT DET
EX ADV AdvType=ex
FW X Foreign=yes
GW X
HVS VERB
HYPH PUNCT PunctType=dash
IN ADP
JJ ADJ Degree=pos
JJR ADJ Degree=comp
JJS ADJ Degree=sup
LS PUNCT NumType=ord
MD VERB VerbType=mod
NFP PUNCT
NIL
NN NOUN Number=sing
NNP PROPN NounType=prop Number=sign
NNPS PROPN NounType=prop Number=plur
NNS NOUN Number=plur
PDT ADJ AdjType=pdt PronType=prn
POS PART Poss=yes
PRP PRON PronType=prs
PRP$ ADJ PronType=prs Poss=yes
RB ADV Degree=pos
RBR ADV Degree=comp
RBS ADV Degree=sup
RP PART
SP SPACE
SYM SYM
TO PART PartType=inf VerbForm=inf
UH INTJ
VB VERB VerbForm=inf
VBD VERB VerbForm=fin Tense=past
VBG VERB VerbForm=part Tense=pres Aspect=prog
VBN VERB VerbForm=part Tense=past Aspect=perf
VBP VERB VerbForm=fin Tense=pres
VBZ VERB VerbForm=fin Tense=pres Number=sing Person=3
WDT ADJ PronType=int|rel
WP NOUN PronType=int|rel
WP$ ADJ Poss=yes PronType=int|rel
WRB ADV PronType=int|rel
XX X

German part-of-speech tag scheme

TagPOSMorphology
$( PUNCT PunctType=brck
$, PUNCT PunctType=comm
$. PUNCT PunctType=peri
ADJA ADJ
ADJD ADJ Variant=short
ADV ADV
APPO ADP AdpType=post
APPR ADP AdpType=prep
APPRART ADP AdpType=prep PronType=art
APZR ADP AdpType=circ
ART DET PronType=art
CARD NUM NumType=card
FM X Foreign=yes
ITJ INTJ
KOKOM CONJ ConjType=comp
KON CONJ
KOUI SCONJ
KOUS SCONJ
NE PROPN
NNE PROPN
NN NOUN
PAV ADV PronType=dem
PROAV ADV PronType=dem
PDAT DET PronType=dem
PDS PRON PronType=dem
PIAT DET PronType=ind|neg|tot
PIDAT DET AdjType=pdt PronType=ind|neg|tot
PIS PRON PronType=ind|neg|tot
PPER PRON PronType=prs
PPOSAT DET Poss=yes PronType=prs
PPOSS PRON PronType=rel
PRELAT DET PronType=rel
PRELS PRON PronType=rel
PRF PRON PronType=prs Reflex=yes
PTKA PART
PTKANT PART PartType=res
PTKNEG PART Negative=yes
PTKVZ PART PartType=vbp
PTKZU PART PartType=inf
PWAT DET PronType=int
PWAV ADV PronType=int
PWS PRON PronType=int
TRUNC X Hyph=yes
VAFIN AUX Mood=ind VerbForm=fin
VAIMP AUX Mood=imp VerbForm=fin
VAINF AUX VerbForm=inf
VAPP AUX Aspect=perf VerbForm=fin
VMFIN VERB Mood=ind VerbForm=fin VerbType=mod
VMINF VERB VerbForm=fin VerbType=mod
VMPP VERB Aspect=perf VerbForm=part VerbType=mod
VVFIN VERB Mood=ind VerbForm=fin
VVIMP VERB Mood=imp VerbForm=fin
VVINF VERB VerbForm=inf
VVIZU VERB VerbForm=inf
VVPP VERB Aspect=perf VerbForm=part
XY X
SP SPACE

Lemmatization

A "lemma" is the uninflected form of a word. In English, this means:

The lemmatization data is taken from WordNet. However, we also add a special case for pronouns: all pronouns are lemmatized to the special token -PRON-.

Syntactic Dependency Parsing

LanguageConverterScheme
EnglishClearNLPCLEAR Style
GermanTIGERTIGER

Named Entity Recognition

TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACILITYBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LANGUAGEAny named language.

The following values are also annotated in a style similar to names:

TypeDescription
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including "%".
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL"first", "second", etc.
CARDINALNumerals that do not fall under another type.

JSON input format for training

spaCy takes training data in the following format:

Example structure

doc: { id: string, paragraphs: [{ raw: string, sents: [int], tokens: [{ start: int, tag: string, head: int, dep: string }], ner: [{ start: int, end: int, label: string }], brackets: [{ start: int, end: int, label: string }] }] }