Part-of-speech tagging

Part-of-speech tags are labels like noun, verb, adjective etc that are assigned to each token in the document. They're useful in rule-based processes. They can also be useful features in some statistical models.

To use spaCy's tagger, you need to have a data pack installed that includes a tagging model. Tagging models are included in the data downloads for English and German. After you load the model, the tagger is applied automatically, as part of the default pipeline. You can then access the tags using the Token.tag and token.pos attributes. For English, the tagger also triggers some simple rule-based morphological processing, which gives you the lemma as well.

Usage

import spacy nlp = spacy.load('en') doc = nlp(u'They told us to duck.') for word in doc: print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)

Rule-based morphology

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not changes its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:

ContextSurfaceLemmaPOSMorphological Features
I was reading the paperreadingreadverbVerbForm=Ger
I don't watch the news, I read the paper.readreadverbVerbForm=Fin, Mood=Ind, Tense=Pres
I read the paper yestedayreadreadverbVerbForm=Fin, Mood=Ind, Tense=Past

English has a relatively simple morphological system, which spaCy handles using rules that can be keyed by the token, the part-of-speech tag, or the combination of the two. The system works as follows:

  1. The tokenizer consults a mapping table TOKENIZER_EXCEPTIONS, which allows sequences of characters to be mapped to multiple tokens. Each token may be assigned a part of speech and one or more morphological features.
  2. The part-of-speech tagger then assigns each token an extended POS tag. In the API, these tags are known as Token.tag. They express the part-of-speech (e.g. VERB) and some amount of morphological information, e.g. that the verb is past tense.
  3. For words whose POS is not set by a prior process, a mapping table TAG_MAP maps the tags to a part-of-speech and a set of morphological features.
  4. Finally, a rule-based deterministic lemmatizer maps the surface form, to a lemma in light of the previously assigned extended part-of-speech and morphological information, without consulting the context of the token. The lemmatizer also accepts list-based exception files, acquired from WordNet.

Part-of-speech tag schemes

The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS tag set.

English part-of-speech tag scheme

TagPOSMorphology
-LRB- PUNCT PunctType=brck PunctSide=ini
-PRB- PUNCT PunctType=brck PunctSide=fin
, PUNCT PunctType=comm
: PUNCT
. PUNCT PunctType=peri
'' PUNCT PunctType=quot PunctSide=fin
"" PUNCT PunctType=quot PunctSide=fin
# SYM SymType=numbersign
`` PUNCT PunctType=quot PunctSide=ini
$ SYM SymType=currency
ADD X
AFX ADJ Hyph=yes
BES VERB
CC CONJ ConjType=coor
CD NUM NumType=card
DT DET
EX ADV AdvType=ex
FW X Foreign=yes
GW X
HVS VERB
HYPH PUNCT PunctType=dash
IN ADP
JJ ADJ Degree=pos
JJR ADJ Degree=comp
JJS ADJ Degree=sup
LS PUNCT NumType=ord
MD VERB VerbType=mod
NFP PUNCT
NIL
NN NOUN Number=sing
NNP PROPN NounType=prop Number=sign
NNPS PROPN NounType=prop Number=plur
NNS NOUN Number=plur
PDT ADJ AdjType=pdt PronType=prn
POS PART Poss=yes
PRP PRON PronType=prs
PRP$ ADJ PronType=prs Poss=yes
RB ADV Degree=pos
RBR ADV Degree=comp
RBS ADV Degree=sup
RP PART
SP SPACE
SYM SYM
TO PART PartType=inf VerbForm=inf
UH INTJ
VB VERB VerbForm=inf
VBD VERB VerbForm=fin Tense=past
VBG VERB VerbForm=part Tense=pres Aspect=prog
VBN VERB VerbForm=part Tense=past Aspect=perf
VBP VERB VerbForm=fin Tense=pres
VBZ VERB VerbForm=fin Tense=pres Number=sing Person=3
WDT ADJ PronType=int|rel
WP NOUN PronType=int|rel
WP$ ADJ Poss=yes PronType=int|rel
WRB ADV PronType=int|rel
XX X

German part-of-speech tag scheme

TagPOSMorphology
$( PUNCT PunctType=brck
$, PUNCT PunctType=comm
$. PUNCT PunctType=peri
ADJA ADJ
ADJD ADJ Variant=short
ADV ADV
APPO ADP AdpType=post
APPR ADP AdpType=prep
APPRART ADP AdpType=prep PronType=art
APZR ADP AdpType=circ
ART DET PronType=art
CARD NUM NumType=card
FM X Foreign=yes
ITJ INTJ
KOKOM CONJ ConjType=comp
KON CONJ
KOUI SCONJ
KOUS SCONJ
NE PROPN
NNE PROPN
NN NOUN
PAV ADV PronType=dem
PROAV ADV PronType=dem
PDAT DET PronType=dem
PDS PRON PronType=dem
PIAT DET PronType=ind|neg|tot
PIDAT DET AdjType=pdt PronType=ind|neg|tot
PIS PRON PronType=ind|neg|tot
PPER PRON PronType=prs
PPOSAT DET Poss=yes PronType=prs
PPOSS PRON PronType=rel
PRELAT DET PronType=rel
PRELS PRON PronType=rel
PRF PRON PronType=prs Reflex=yes
PTKA PART
PTKANT PART PartType=res
PTKNEG PART Negative=yes
PTKVZ PART PartType=vbp
PTKZU PART PartType=inf
PWAT DET PronType=int
PWAV ADV PronType=int
PWS PRON PronType=int
TRUNC X Hyph=yes
VAFIN AUX Mood=ind VerbForm=fin
VAIMP AUX Mood=imp VerbForm=fin
VAINF AUX VerbForm=inf
VAPP AUX Aspect=perf VerbForm=fin
VMFIN VERB Mood=ind VerbForm=fin VerbType=mod
VMINF VERB VerbForm=fin VerbType=mod
VMPP VERB Aspect=perf VerbForm=part VerbType=mod
VVFIN VERB Mood=ind VerbForm=fin
VVIMP VERB Mood=imp VerbForm=fin
VVINF VERB VerbForm=inf
VVIZU VERB VerbForm=inf
VVPP VERB Aspect=perf VerbForm=part
XY X
SP SPACE