Overview

Annotation Specifications

Schemes used for labels, tags and training data

Text processing

Tokenization standards are based on the OntoNotes 5 corpus. The tokenizer differs from most by including tokens for significant whitespace. Any sequence of whitespace characters beyond a single space (' ') is included as a token. The whitespace tokens are useful for much the same reason punctuation is – it’s often an important delimiter in the text. By preserving it in the token output, we are able to maintain a simple alignment between the tokens and the original string, and we ensure that no information is lost during processing.

Lemmatization

A lemma is the uninflected form of a word. The English lemmatization data is taken from WordNet. Lookup tables are taken from Lexiconista. spaCy also adds a special case for pronouns: all pronouns are lemmatized to the special token -PRON-.

Sentence boundary detection

Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalization play an important but non-decisive role in determining the sentence boundaries. Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text.

Part-of-speech tagging

This section lists the fine-grained and coarse-grained part-of-speech tags assigned by spaCy’s models. The individual mapping is specific to the training corpus and can be defined in the respective language data’s tag_map.py.

spaCy also maps all language-specific part-of-speech tags to a small, fixed set of word type tags following the Universal Dependencies scheme. The universal tags don’t code for any morphological features and only cover the word type. They’re available as the Token.pos and Token.pos_ attributes.

POSDescriptionExamples
ADJadjectivebig, old, green, incomprehensible, first
ADPadpositionin, to, during
ADVadverbvery, tomorrow, down, where, there
AUXauxiliaryis, has (done), will (do), should (do)
CONJconjunctionand, or, but
CCONJcoordinating conjunctionand, or, but
DETdeterminera, an, the
INTJinterjectionpsst, ouch, bravo, hello
NOUNnoungirl, cat, tree, air, beauty
NUMnumeral1, 2017, one, seventy-seven, IV, MMXIV
PARTparticle’s, not,
PRONpronounI, you, he, she, myself, themselves, somebody
PROPNproper nounMary, John, London, NATO, HBO
PUNCTpunctuation., (, ), ?
SCONJsubordinating conjunctionif, while, that
SYMsymbol$, %, §, ©, +, −, ×, ÷, =, :), 😝
VERBverbrun, runs, running, eat, ate, eating
Xothersfpksdpsxmsa
SPACEspace

The English part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS tag set.

Tag POSMorphologyDescription
-LRB-PUNCTPunctType=brck PunctSide=inileft round bracket
-RRB-PUNCTPunctType=brck PunctSide=finright round bracket
,PUNCTPunctType=commpunctuation mark, comma
:PUNCTpunctuation mark, colon or ellipsis
.PUNCTPunctType=peripunctuation mark, sentence closer
''PUNCTPunctType=quot PunctSide=finclosing quotation mark
""PUNCTPunctType=quot PunctSide=finclosing quotation mark
``PUNCTPunctType=quot PunctSide=iniopening quotation mark
#SYMSymType=numbersignsymbol, number sign
$SYMSymType=currencysymbol, currency
ADDXemail
AFXADJHyph=yesaffix
BESVERBauxiliary “be”
CCCONJConjType=coorconjunction, coordinating
CDNUMNumType=cardcardinal number
DTDETdeterminer
EXADVAdvType=exexistential there
FWXForeign=yesforeign word
GWXadditional word in multi-word expression
HVSVERBforms of “have”
HYPHPUNCTPunctType=dashpunctuation mark, hyphen
INADPconjunction, subordinating or preposition
JJADJDegree=posadjective
JJRADJDegree=compadjective, comparative
JJSADJDegree=supadjective, superlative
LSPUNCTNumType=ordlist item marker
MDVERBVerbType=modverb, modal auxiliary
NFPPUNCTsuperfluous punctuation
NILmissing tag
NNNOUNNumber=singnoun, singular or mass
NNPPROPNNounType=prop Number=signnoun, proper singular
NNPSPROPNNounType=prop Number=plurnoun, proper plural
NNSNOUNNumber=plurnoun, plural
PDTADJAdjType=pdt PronType=prnpredeterminer
POSPARTPoss=yespossessive ending
PRPPRONPronType=prspronoun, personal
PRP$ADJPronType=prs Poss=yespronoun, possessive
RBADVDegree=posadverb
RBRADVDegree=compadverb, comparative
RBSADVDegree=supadverb, superlative
RPPARTadverb, particle
_SPSPACEspace
SYMSYMsymbol
TOPARTPartType=inf VerbForm=infinfinitival “to”
UHINTJinterjection
VBVERBVerbForm=infverb, base form
VBDVERBVerbForm=fin Tense=pastverb, past tense
VBGVERBVerbForm=part Tense=pres Aspect=progverb, gerund or present participle
VBNVERBVerbForm=part Tense=past Aspect=perfverb, past participle
VBPVERBVerbForm=fin Tense=presverb, non-3rd person singular present
VBZVERBVerbForm=fin Tense=pres Number=sing Person=3verb, 3rd person singular present
WDTADJPronType=int|relwh-determiner
WPNOUNPronType=int|relwh-pronoun, personal
WP$ADJPoss=yes PronType=int|relwh-pronoun, possessive
WRBADVPronType=int|relwh-adverb
XXXunknown

The German part-of-speech tagger uses the TIGER Treebank annotation scheme. We also map the tags to the simpler Google Universal POS tag set.

Tag POSMorphologyDescription
$(PUNCTPunctType=brckother sentence-internal punctuation mark
$,PUNCTPunctType=commcomma
$.PUNCTPunctType=perisentence-final punctuation mark
ADJAADJadjective, attributive
ADJDADJVariant=shortadjective, adverbial or predicative
ADVADVadverb
APPOADPAdpType=postpostposition
APPRADPAdpType=preppreposition; circumposition left
APPRARTADPAdpType=prep PronType=artpreposition with article
APZRADPAdpType=circcircumposition right
ARTDETPronType=artdefinite or indefinite article
CARDNUMNumType=cardcardinal number
FMXForeign=yesforeign language material
ITJINTJinterjection
KOKOMCONJConjType=compcomparative conjunction
KONCONJcoordinate conjunction
KOUISCONJsubordinate conjunction with “zu” and infinitive
KOUSSCONJsubordinate conjunction with sentence
NEPROPNproper noun
NNEPROPNproper noun
NNNOUNnoun, singular or mass
PAVADVPronType=dempronominal adverb
PROAVADVPronType=dempronominal adverb
PDATDETPronType=demattributive demonstrative pronoun
PDSPRONPronType=demsubstituting demonstrative pronoun
PIATDETPronType=ind|neg|totattributive indefinite pronoun without determiner
PIDATDETAdjType=pdt PronType=ind|neg|totattributive indefinite pronoun with determiner
PISPRONPronType=ind|neg|totsubstituting indefinite pronoun
PPERPRONPronType=prsnon-reflexive personal pronoun
PPOSATDETPoss=yes PronType=prsattributive possessive pronoun
PPOSSPRONPronType=relsubstituting possessive pronoun
PRELATDETPronType=relattributive relative pronoun
PRELSPRONPronType=relsubstituting relative pronoun
PRFPRONPronType=prs Reflex=yesreflexive personal pronoun
PTKAPARTparticle with adjective or adverb
PTKANTPARTPartType=resanswer particle
PTKNEGPARTNegative=yesnegative particle
PTKVZPARTPartType=vbpseparable verbal particle
PTKZUPART`PartType=inf” | “zu” before infinitive
PWATDETPronType=intattributive interrogative pronoun
PWAVADVPronType=intadverbial interrogative or relative pronoun
PWSPRONPronType=intsubstituting interrogative pronoun
TRUNCXHyph=yesword remnant
VAFINAUXMood=ind VerbForm=finfinite verb, auxiliary
VAIMPAUXMood=imp VerbForm=finimperative, auxiliary
VAINFAUXVerbForm=infinfinitive, auxiliary
VAPPAUXAspect=perf VerbForm=finperfect participle, auxiliary
VMFINVERBMood=ind VerbForm=fin VerbType=modfinite verb, modal
VMINFVERBVerbForm=fin VerbType=modinfinitive, modal
VMPPVERBAspect=perf VerbForm=part VerbType=modperfect participle, modal
VVFINVERBMood=ind VerbForm=finfinite verb, full
VVIMPVERBMood=imp VerbForm=finimperative, full
VVINFVERBVerbForm=infinfinitive, full
VVIZUVERB`VerbForm=inf” | infinitive with “zu”, full
VVPPVERBAspect=perf VerbForm=partperfect participle, full
XYXnon-word containing non-letter
SPSPACEspace

Syntactic Dependency Parsing

This section lists the syntactic dependency labels assigned by spaCy’s models. The individual labels are language-specific and depend on the training corpus.

The Universal Dependencies scheme is used in all languages trained on Universal Dependency Corpora.

LabelDescription
aclclausal modifier of noun (adjectival clause)
advcladverbial clause modifier
advmodadverbial modifier
amodadjectival modifier
apposappositional modifier
auxauxiliary
casecase marking
cccoordinating conjunction
ccompclausal complement
clfclassifier
compoundcompound
conjconjunct
copcopula
csubjclausal subject
depunspecified dependency
detdeterminer
discoursediscourse element
dislocateddislocated elements
explexpletive
fixedfixed multiword expression
flatflat multiword expression
goeswithgoes with
iobjindirect object
listlist
markmarker
nmodnominal modifier
nsubjnominal subject
nummodnumeric modifier
objobject
obloblique nominal
orphanorphan
parataxisparataxis
punctpunctuation
reparandumoverridden disfluency
rootroot
vocativevocative
xcompopen clausal complement

The English dependency labels use the CLEAR Style by ClearNLP.

LabelDescription
aclclausal modifier of noun (adjectival clause)
acompadjectival complement
advcladverbial clause modifier
advmodadverbial modifier
agentagent
amodadjectival modifier
apposappositional modifier
attrattribute
auxauxiliary
auxpassauxiliary (passive)
casecase marking
cccoordinating conjunction
ccompclausal complement
compoundcompound
conjconjunct
copcopula
csubjclausal subject
csubjpassclausal subject (passive)
dativedative
depunclassified dependent
detdeterminer
dobjdirect object
explexpletive
intjinterjection
markmarker
metameta modifier
negnegation modifier
nnnoun compound modifier
nounmodmodifier of nominal
npmodnoun phrase as adverbial modifier
nsubjnominal subject
nsubjpassnominal subject (passive)
nummodnumeric modifier
oprdobject predicate
objobject
obloblique nominal
parataxisparataxis
pcompcomplement of preposition
pobjobject of preposition
posspossession modifier
preconjpre-correlative conjunction
prepprepositional modifier
prtparticle
punctpunctuation
quantmodmodifier of quantifier
relclrelative clause modifier
rootroot
xcompopen clausal complement

The German dependency labels use the TIGER Treebank annotation scheme.

LabelDescription
acadpositional case marker
adcadjective component
aggenitive attribute
amsmeasure argument of adjective
appapposition
avcadverbial phrase component
cccomparative complement
cdcoordinating conjunction
cjconjunct
cmcomparative conjunction
cpcomplementizer
cvccollocational verb construction
dadative
dhdiscourse-level head
dmdiscourse marker
epexpletive es
hdhead
jujunctor
mnrpostnominal modifier
momodifier
ngnegation
nknoun kernel element
nmcnumerical component
oaaccusative object
oasecond accusative object
occlausal object
oggenitive object
opprepositional object
parparenthetical element
pdpredicate
pgphrasal genitive
phplaceholder
pmmorphological particle
pncproper noun component
rcrelative clause
rerepeated element
rsreported speech
sbsubject
sp“subject or predicate
svpseparable verb prefix
ucunit component
vovocative
ROOTroot

Named Entity Recognition

Models trained on the OntoNotes 5 corpus support the following entity types:

TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LAWNamed documents made into laws.
LANGUAGEAny named language.
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including ”%“.
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL“first”, “second”, etc.
CARDINALNumerals that do not fall under another type.

Wikipedia scheme

Models trained on Wikipedia corpus (Nothman et al., 2013) use a less fine-grained NER annotation scheme and recognise the following entities:

TypeDescription
PERNamed person or family.
LOCName of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains).
ORGNamed corporate, governmental, or other organizational entity.
MISCMiscellaneous entities, e.g. events, nationalities, products or works of art.

IOB Scheme

TagIDDescription
"I"1Token is inside an entity.
"O"2Token is outside an entity.
"B"3Token begins an entity.
""0No entity tag is set (missing value).

BILUO Scheme

TagDescription
BEGINThe first token of a multi-token entity.
INAn inner token of a multi-token entity.
LASTThe final token of a multi-token entity.
UNITA single-token entity.
OUTA non-entity token.

spaCy translates the character offsets into this scheme, in order to decide the cost of each action given the current state of the entity recogniser. The costs are then used to calculate the gradient of the loss, to train the model. The exact algorithm is a pastiche of well-known methods, and is not currently described in any single publication. The model is a greedy transition-based parser guided by a linear model whose weights are learned using the averaged perceptron loss, via the dynamic oracle imitation learning strategy. The transition system is equivalent to the BILOU tagging scheme.

Models and training data

JSON input format for training

spaCy takes training data in JSON format. The built-in convert command helps you convert the .conllu format used by the Universal Dependencies corpora to spaCy’s training format.

Example structure

[{ "id": int, # ID of the document within the corpus "paragraphs": [{ # list of paragraphs in the corpus "raw": string, # raw text of the paragraph "sentences": [{ # list of sentences in the paragraph "tokens": [{ # list of tokens in the sentence "id": int, # index of the token in the document "dep": string, # dependency label "head": int, # offset of token head relative to token index "tag": string, # part-of-speech tag "orth": string, # verbatim text of the token "ner": string # BILUO label, e.g. "O" or "B-ORG" }], "brackets": [{ # phrase structure (NOT USED by current models) "first": int, # index of first token "last": int, # index of last token "label": string # phrase label }] }] }] }]

Here’s an example of dependencies, part-of-speech tags and names entities, taken from the English Wall Street Journal portion of the Penn Treebank:

explosion/spaCy/master/examples/training/training-data.json
Can't fetch code example from GitHub :( Please use the link below to view the example. If you've come across a broken link, we always appreciate a pull request to the repository, or a report on the issue tracker. Thanks!

Lexical data for vocabulary v2.0

To populate a model’s vocabulary, you can use the spacy init-model command and load in a newline-delimited JSON (JSONL) file containing one lexical entry per line via the --jsonl-loc option. The first line defines the language and vocabulary settings. All other lines are expected to be JSON objects describing an individual lexeme. The lexical attributes will be then set as attributes on spaCy’s Lexeme object. The vocab command outputs a ready-to-use spaCy model with a Vocab containing the lexical data.

First line

{"lang": "en", "settings": {"oov_prob": -20.502029418945312}}

Entry structure

{ "orth": string, "id": int, "lower": string, "norm": string, "shape": string "prefix": string, "suffix": string, "length": int, "cluster": string, "prob": float, "is_alpha": bool, "is_ascii": bool, "is_digit": bool, "is_lower": bool, "is_punct": bool, "is_space": bool, "is_title": bool, "is_upper": bool, "like_url": bool, "like_num": bool, "like_email": bool, "is_stop": bool, "is_oov": bool, "is_quote": bool, "is_left_punct": bool, "is_right_punct": bool }

Here’s an example of the 20 most frequent lexemes in the English training data:

explosion/spaCy/master/examples/training/vocab-data.jsonl
Can't fetch code example from GitHub :( Please use the link below to view the example. If you've come across a broken link, we always appreciate a pull request to the repository, or a report on the issue tracker. Thanks!