Annotation Specifications

This document describes the target annotations spaCy is trained to predict.


Tokenization standards are based on the OntoNotes 5 corpus. The tokenizer differs from most by including tokens for significant whitespace. Any sequence of whitespace characters beyond a single space (' ') is included as a token.

The whitespace tokens are useful for much the same reason punctuation is – it's often an important delimiter in the text. By preserving it in the token output, we are able to maintain a simple alignment between the tokens and the original string, and we ensure that no information is lost during processing.

Sentence boundary detection

Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries. Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text.

Part-of-speech Tagging

English part-of-speech tag scheme

The English part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS tag set.

-LRB-PUNCT PunctType=brck PunctSide=inileft round bracket
-PRB-PUNCT PunctType=brck PunctSide=finright round bracket
,PUNCT PunctType=commpunctuation mark, comma
:PUNCTpunctuation mark, colon or ellipsis
.PUNCT PunctType=peripunctuation mark, sentence closer
''PUNCT PunctType=quot PunctSide=finclosing quotation mark
""PUNCT PunctType=quot PunctSide=finclosing quotation mark
#SYM SymType=numbersignsymbol, number sign
``PUNCT PunctType=quot PunctSide=iniopening quotation mark
$SYM SymType=currencysymbol, currency
AFXADJ Hyph=yesaffix
BESVERBauxillary "be"
CCCONJ ConjType=coorconjunction, coordinating
CDNUM NumType=cardcardinal number
DTDET determiner
EXADV AdvType=exexistential there
FWX Foreign=yesforeign word
GWXadditional word in multi-word expression
HVSVERBforms of "have"
HYPHPUNCT PunctType=dashpunctuation mark, hyphen
INADPconjunction, subordinating or preposition
JJADJ Degree=posadjective
JJRADJ Degree=compadjective, comparative
JJSADJ Degree=supadjective, superlative
LSPUNCT NumType=ordlist item marker
MDVERB VerbType=modverb, modal auxillary
NFPPUNCTsuperfluous punctuation
NILmissing tag
NNNOUN Number=singnoun, singular or mass
NNPPROPN NounType=prop Number=signnoun, proper singular
NNPSPROPN NounType=prop Number=plurnoun, proper plural
NNSNOUN Number=plurnoun, plural
PDTADJ AdjType=pdt PronType=prnpredeterminer
POSPART Poss=yespossessive ending
PRPPRON PronType=prspronoun, personal
PRP$ADJ PronType=prs Poss=yespronoun, possessive
RBADV Degree=posadverb
RBRADV Degree=compadverb, comparative
RBSADV Degree=supadverb, superlative
RPPARTadverb, particle
TOPART PartType=inf VerbForm=infinfinitival to
VBVERB VerbForm=infverb, base form
VBDVERB VerbForm=fin Tense=pastverb, past tense
VBGVERB VerbForm=part Tense=pres Aspect=progverb, gerund or present participle
VBNVERB VerbForm=part Tense=past Aspect=perfverb, past participle
VBPVERB VerbForm=fin Tense=presverb, non-3rd person singular present
VBZVERB VerbForm=fin Tense=pres Number=sing Person=3verb, 3rd person singular present
WDTADJ PronType=int|relwh-determiner
WPNOUN PronType=int|relwh-pronoun, personal
WP$ADJ Poss=yes PronType=int|relwh-pronoun, possessive
WRBADV PronType=int|relwh-adverb

German part-of-speech tag scheme

The German part-of-speech tagger uses the TIGER Treebank annotation scheme. We also map the tags to the simpler Google Universal POS tag set.

$(PUNCT PunctType=brckother sentence-internal punctuation mark
$,PUNCT PunctType=commcomma
$.PUNCT PunctType=perisentence-final punctuation mark
ADJAADJadjective, attributive
ADJDADJ Variant=shortadjective, adverbial or predicative
APPOADP AdpType=postpostposition
APPRADP AdpType=preppreposition; circumposition left
APPRARTADP AdpType=prep PronType=artpreposition with article
APZRADP AdpType=circcircumposition right
ARTDET PronType=artdefinite or indefinite article
CARDNUM NumType=cardcardinal number
FMX Foreign=yesforeign language material
KOKOMCONJ ConjType=compcomparative conjunction
KONCONJcoordinate conjunction
KOUISCONJsubordinate conjunction with "zu" and infinitive
KOUSSCONJsubordinate conjunction with sentence
NEPROPNproper noun
NNEPROPNproper noun
NNNOUNnoun, singular or mass
PAVADV PronType=dempronominal adverb
PROAVADV PronType=dempronominal adverb
PDATDET PronType=demattributive demonstrative pronoun
PDSPRON PronType=demsubstituting demonstrative pronoun
PIATDET PronType=ind|neg|totattributive indefinite pronoun without determiner
PIDATDET AdjType=pdt PronType=ind|neg|totattributive indefinite pronoun with determiner
PISPRON PronType=ind|neg|totsubstituting indefinite pronoun
PPERPRON PronType=prsnon-reflexive personal pronoun
PPOSATDET Poss=yes PronType=prsattributive possessive pronoun
PPOSSPRON PronType=relsubstituting possessive pronoun
PRELATDET PronType=relattributive relative pronoun
PRELSPRON PronType=relsubstituting relative pronoun
PRFPRON PronType=prs Reflex=yesreflexive personal pronoun
PTKAPARTparticle with adjective or adverb
PTKANTPART PartType=resanswer particle
PTKNEGPART Negative=yesnegative particle
PTKVZPART PartType=vbpseparable verbal particle
PTKZUPART PartType=inf"zu" before infinitive
PWATDET PronType=intattributive interrogative pronoun
PWAVADV PronType=intadverbial interrogative or relative pronoun
PWSPRON PronType=intsubstituting interrogative pronoun
TRUNCX Hyph=yesword remnant
VAFINAUX Mood=ind VerbForm=finfinite verb, auxiliary
VAIMPAUX Mood=imp VerbForm=finimperative, auxiliary
VAINFAUX VerbForm=infinfinitive, auxiliary
VAPPAUX Aspect=perf VerbForm=finperfect participle, auxiliary
VMFINVERB Mood=ind VerbForm=fin VerbType=modfinite verb, modal
VMINFVERB VerbForm=fin VerbType=modinfinitive, modal
VMPPVERB Aspect=perf VerbForm=part VerbType=modperfect participle, modal
VVFINVERB Mood=ind VerbForm=finfinite verb, full
VVIMPVERB Mood=imp VerbForm=finimperative, full
VVINFVERB VerbForm=infinfinitive, full
VVIZUVERB VerbForm=infinfinitive with "zu", full
VVPPVERB Aspect=perf VerbForm=partperfect participle, full
XYXnon-word containing non-letter


A "lemma" is the uninflected form of a word. In English, this means:

The lemmatization data is taken from WordNet. However, we also add a special case for pronouns: all pronouns are lemmatized to the special token -PRON-.

Syntactic Dependency Parsing

English dependency labels

The English dependency labels use the ClearNLP CLEAR Style.

acompadjectival complement
advcladverbial clause modifier
advmodadverbial modifier
amodadjectival modifier
apposappositional modifier
auxpassauxiliary (passive)
cccoordinating conjunction
ccompclausal complement
csubjclausal subject
csubjpassclausal subject (passive)
depunclassified dependent
dobjdirect object
hmodmodifier in hyphenation
infmodinfinitival modifier
iobjindirect object
metameta modifier
negnegation modifier
nmodmodifier of nominal
nnnoun compound modifier
npadvmodnoun phrase as adverbial modifier
nsubjnominal subject
nsubjpassnominal subject (passive)
numnumber modifier
numbernumber compound modifier
oprdobject predicate
obloblique nominal
partmodparticipal modifier
pcompcomplement of preposition
pobjobject of preposition
posspossession modifier
possessivepossessive modifier
preconjpre-correlative conjunction
prepprepositional modifier
quantmodmodifier of quantifier
rcmodrelative clause modifier
xcompopen clausal complement

German dependency labels

The German dependency labels use the TIGER Treebank annotation scheme.

acadpositional case marker
adcadjective component
aggenitive attribute
amsmeasure argument of adjective
avcadverbial phrase component
cccomparative complement
cdcoordinating conjunction
cmcomparative conjunction
cvccollocational verb construction
dhdiscourse-level head
dmdiscourse marker
epexpletive es
mnrpostnominal modifier
nknoun kernel element
nmcnumerical component
oaaccusative object
oasecond accusative object
occlausal object
oggenitive object
opprepositional object
parparenthetical element
pgphrasal genitive
pmmorphological particle
pncproper noun component
rcrelative clause
rerepeated element
rsreported speech

Named Entity Recognition

PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACILITYBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LANGUAGEAny named language.

The following values are also annotated in a style similar to names:

DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including "%".
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL"first", "second", etc.
CARDINALNumerals that do not fall under another type.

JSON input format for training

spaCy takes training data in the following format:

Example structure

doc: { id: string, paragraphs: [{ raw: string, sents: [int], tokens: [{ start: int, tag: string, head: int, dep: string }], ner: [{ start: int, end: int, label: string }], brackets: [{ start: int, end: int, label: string }] }] }