Annotation Specifications

This document describes the target annotations spaCy is trained to predict.

Tokenization

Tokenization standards are based on the OntoNotes 5 corpus. The tokenizer differs from most by including tokens for significant whitespace. Any sequence of whitespace characters beyond a single space (' ') is included as a token.

The whitespace tokens are useful for much the same reason punctuation is – it's often an important delimiter in the text. By preserving it in the token output, we are able to maintain a simple alignment between the tokens and the original string, and we ensure that no information is lost during processing.

Sentence boundary detection

Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries. Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text.

Part-of-speech Tagging

English part-of-speech tag scheme

The English part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS tag set.

TagPOSMorphologyDescription
-LRB-PUNCT PunctType=brck PunctSide=inileft round bracket
-PRB-PUNCT PunctType=brck PunctSide=finright round bracket
,PUNCT PunctType=commpunctuation mark, comma
:PUNCTpunctuation mark, colon or ellipsis
.PUNCT PunctType=peripunctuation mark, sentence closer
''PUNCT PunctType=quot PunctSide=finclosing quotation mark
""PUNCT PunctType=quot PunctSide=finclosing quotation mark
#SYM SymType=numbersignsymbol, number sign
``PUNCT PunctType=quot PunctSide=iniopening quotation mark
$SYM SymType=currencysymbol, currency
ADDXemail
AFXADJ Hyph=yesaffix
BESVERBauxillary "be"
CCCONJ ConjType=coorconjunction, coordinating
CDNUM NumType=cardcardinal number
DTDET determiner
EXADV AdvType=exexistential there
FWX Foreign=yesforeign word
GWXadditional word in multi-word expression
HVSVERBforms of "have"
HYPHPUNCT PunctType=dashpunctuation mark, hyphen
INADPconjunction, subordinating or preposition
JJADJ Degree=posadjective
JJRADJ Degree=compadjective, comparative
JJSADJ Degree=supadjective, superlative
LSPUNCT NumType=ordlist item marker
MDVERB VerbType=modverb, modal auxillary
NFPPUNCTsuperfluous punctuation
NILmissing tag
NNNOUN Number=singnoun, singular or mass
NNPPROPN NounType=prop Number=signnoun, proper singular
NNPSPROPN NounType=prop Number=plurnoun, proper plural
NNSNOUN Number=plurnoun, plural
PDTADJ AdjType=pdt PronType=prnpredeterminer
POSPART Poss=yespossessive ending
PRPPRON PronType=prspronoun, personal
PRP$ADJ PronType=prs Poss=yespronoun, possessive
RBADV Degree=posadverb
RBRADV Degree=compadverb, comparative
RBSADV Degree=supadverb, superlative
RPPARTadverb, particle
SPSPACEspace
SYMSYMsymbol
TOPART PartType=inf VerbForm=infinfinitival to
UHINTJinterjection
VBVERB VerbForm=infverb, base form
VBDVERB VerbForm=fin Tense=pastverb, past tense
VBGVERB VerbForm=part Tense=pres Aspect=progverb, gerund or present participle
VBNVERB VerbForm=part Tense=past Aspect=perfverb, past participle
VBPVERB VerbForm=fin Tense=presverb, non-3rd person singular present
VBZVERB VerbForm=fin Tense=pres Number=sing Person=3verb, 3rd person singular present
WDTADJ PronType=int|relwh-determiner
WPNOUN PronType=int|relwh-pronoun, personal
WP$ADJ Poss=yes PronType=int|relwh-pronoun, possessive
WRBADV PronType=int|relwh-adverb
XXXunknown

German part-of-speech tag scheme

The German part-of-speech tagger uses the TIGER Treebank annotation scheme. We also map the tags to the simpler Google Universal POS tag set.

TagPOSMorphologyDescription
$(PUNCT PunctType=brckother sentence-internal punctuation mark
$,PUNCT PunctType=commcomma
$.PUNCT PunctType=perisentence-final punctuation mark
ADJAADJadjective, attributive
ADJDADJ Variant=shortadjective, adverbial or predicative
ADVADVadverb
APPOADP AdpType=postpostposition
APPRADP AdpType=preppreposition; circumposition left
APPRARTADP AdpType=prep PronType=artpreposition with article
APZRADP AdpType=circcircumposition right
ARTDET PronType=artdefinite or indefinite article
CARDNUM NumType=cardcardinal number
FMX Foreign=yesforeign language material
ITJINTJinterjection
KOKOMCONJ ConjType=compcomparative conjunction
KONCONJcoordinate conjunction
KOUISCONJsubordinate conjunction with "zu" and infinitive
KOUSSCONJsubordinate conjunction with sentence
NEPROPNproper noun
NNEPROPNproper noun
NNNOUNnoun, singular or mass
PAVADV PronType=dempronominal adverb
PROAVADV PronType=dempronominal adverb
PDATDET PronType=demattributive demonstrative pronoun
PDSPRON PronType=demsubstituting demonstrative pronoun
PIATDET PronType=ind|neg|totattributive indefinite pronoun without determiner
PIDATDET AdjType=pdt PronType=ind|neg|totattributive indefinite pronoun with determiner
PISPRON PronType=ind|neg|totsubstituting indefinite pronoun
PPERPRON PronType=prsnon-reflexive personal pronoun
PPOSATDET Poss=yes PronType=prsattributive possessive pronoun
PPOSSPRON PronType=relsubstituting possessive pronoun
PRELATDET PronType=relattributive relative pronoun
PRELSPRON PronType=relsubstituting relative pronoun
PRFPRON PronType=prs Reflex=yesreflexive personal pronoun
PTKAPARTparticle with adjective or adverb
PTKANTPART PartType=resanswer particle
PTKNEGPART Negative=yesnegative particle
PTKVZPART PartType=vbpseparable verbal particle
PTKZUPART PartType=inf"zu" before infinitive
PWATDET PronType=intattributive interrogative pronoun
PWAVADV PronType=intadverbial interrogative or relative pronoun
PWSPRON PronType=intsubstituting interrogative pronoun
TRUNCX Hyph=yesword remnant
VAFINAUX Mood=ind VerbForm=finfinite verb, auxiliary
VAIMPAUX Mood=imp VerbForm=finimperative, auxiliary
VAINFAUX VerbForm=infinfinitive, auxiliary
VAPPAUX Aspect=perf VerbForm=finperfect participle, auxiliary
VMFINVERB Mood=ind VerbForm=fin VerbType=modfinite verb, modal
VMINFVERB VerbForm=fin VerbType=modinfinitive, modal
VMPPVERB Aspect=perf VerbForm=part VerbType=modperfect participle, modal
VVFINVERB Mood=ind VerbForm=finfinite verb, full
VVIMPVERB Mood=imp VerbForm=finimperative, full
VVINFVERB VerbForm=infinfinitive, full
VVIZUVERB VerbForm=infinfinitive with "zu", full
VVPPVERB Aspect=perf VerbForm=partperfect participle, full
XYXnon-word containing non-letter
SPSPACEspace

Lemmatization

A "lemma" is the uninflected form of a word. In English, this means:

The lemmatization data is taken from WordNet. However, we also add a special case for pronouns: all pronouns are lemmatized to the special token -PRON-.

Syntactic Dependency Parsing

English dependency labels

The English dependency labels use the ClearNLP CLEAR Style.

LabelDescription
acompadjectival complement
advcladverbial clause modifier
advmodadverbial modifier
agentagent
amodadjectival modifier
apposappositional modifier
attrattribute
auxauxiliary
auxpassauxiliary (passive)
cccoordinating conjunction
ccompclausal complement
complmcomplementizer
conjconjunct
copcopula
csubjclausal subject
csubjpassclausal subject (passive)
depunclassified dependent
detdeterminer
dobjdirect object
explexpletive
hmodmodifier in hyphenation
hyphhyphen
infmodinfinitival modifier
intjinterjection
iobjindirect object
markmarker
metameta modifier
negnegation modifier
nmodmodifier of nominal
nnnoun compound modifier
npadvmodnoun phrase as adverbial modifier
nsubjnominal subject
nsubjpassnominal subject (passive)
numnumber modifier
numbernumber compound modifier
oprdobject predicate
objobject
obloblique nominal
parataxisparataxis
partmodparticipal modifier
pcompcomplement of preposition
pobjobject of preposition
posspossession modifier
possessivepossessive modifier
preconjpre-correlative conjunction
prepprepositional modifier
prtparticle
punctpunctuation
quantmodmodifier of quantifier
rcmodrelative clause modifier
rootroot
xcompopen clausal complement

German dependency labels

The German dependency labels use the TIGER Treebank annotation scheme.

LabelDescription
acadpositional case marker
adcadjective component
aggenitive attribute
amsmeasure argument of adjective
appapposition
avcadverbial phrase component
cccomparative complement
cdcoordinating conjunction
cjconjunct
cmcomparative conjunction
cpcomplementizer
cvccollocational verb construction
dadative
dhdiscourse-level head
dmdiscourse marker
epexpletive es
hdhead
jujunctor
mnrpostnominal modifier
momodifier
ngnegation
nknoun kernel element
nmcnumerical component
oaaccusative object
oasecond accusative object
occlausal object
oggenitive object
opprepositional object
parparenthetical element
pdpredicate
pgphrasal genitive
phplaceholder
pmmorphological particle
pncproper noun component
rcrelative clause
rerepeated element
rsreported speech
sbsubject

Named Entity Recognition

TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACILITYBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LANGUAGEAny named language.

The following values are also annotated in a style similar to names:

TypeDescription
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including "%".
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL"first", "second", etc.
CARDINALNumerals that do not fall under another type.

JSON input format for training

spaCy takes training data in the following format:

Example structure

doc: { id: string, paragraphs: [{ raw: string, sents: [int], tokens: [{ start: int, tag: string, head: int, dep: string }], ner: [{ start: int, end: int, label: string }], brackets: [{ start: int, end: int, label: string }] }] }