scikit

Linguistic Features
Using spaCy to extract linguistic features like part-of-speech tags, dependency labels and named entities, customising the tokenizer and working with the rule-based matcher.

Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

Part-of-speech tagging
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: tagger, dependency parse.

After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.

Linguistic annotations are available as Token attributes . Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)
TextLemmaPOSTagDepShapealphastop
AppleapplePROPNNNPnsubjXxxxxTrueFalse
isbeVERBVBZauxxxTrueTrue
lookinglookVERBVBGROOTxxxxTrueFalse
atatADPINprepxxTrueTrue
buyingbuyVERBVBGpcompxxxxTrueFalse
U.K.u.k.PROPNNNPcompoundX.X.FalseFalse
startupstartupNOUNNNdobjxxxxTrueFalse
forforADPINprepxxxTrueTrue
$$SYM$quantmod$FalseFalse
11NUMCDcompounddFalseFalse
billionbillionNUMCDpobjxxxxTrueFalse

Using spaCy's built-in displaCy visualizer, here's what our example sentence and its dependencies look like:

Rule-based morphology

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not changes its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:

ContextSurfaceLemmaPOSMorphological Features
I was reading the paperreadingreadverbVerbForm=Ger
I don't watch the news, I read the paper.readreadverbVerbForm=Fin, Mood=Ind, Tense=Pres
I read the paper yesterdayreadreadverbVerbForm=Fin, Mood=Ind, Tense=Past

English has a relatively simple morphological system, which spaCy handles using rules that can be keyed by the token, the part-of-speech tag, or the combination of the two. The system works as follows:

  1. The tokenizer consults a mapping table TOKENIZER_EXCEPTIONS, which allows sequences of characters to be mapped to multiple tokens. Each token may be assigned a part of speech and one or more morphological features.
  2. The part-of-speech tagger then assigns each token an extended POS tag. In the API, these tags are known as Token.tag. They express the part-of-speech (e.g. VERB) and some amount of morphological information, e.g. that the verb is past tense.
  3. For words whose POS is not set by a prior process, a mapping table TAG_MAP maps the tags to a part-of-speech and a set of morphological features.
  4. Finally, a rule-based deterministic lemmatizer maps the surface form, to a lemma in light of the previously assigned extended part-of-speech and morphological information, without consulting the context of the token. The lemmatizer also accepts list-based exception files, acquired from WordNet.

Dependency parsing
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: dependency parse.

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks". You can check whether a Doc object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception.

Noun chunks

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, "the lavish green grass" or "the world’s largest tech fund". To get the noun chunks in a document, simply iterate over Doc.noun_chunks .

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)
Textroot.textroot.dep_root.head.text
Autonomous carscarsnsubjshift
insurance liabilityliabilitydobjshift
manufacturersmanufacturerspobjtoward

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep_.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
          [child for child in token.children])
TextDepHead textHead POSChildren
AutonomousamodcarsNOUN
carsnsubjshiftVERBAutonomous
shiftROOTshiftVERBcars, liability, toward
insurancecompoundliabilityNOUN
liabilitydobjshiftVERBinsurance
towardprepliabilityNOUNmanufacturers
manufacturerspobjtowardADP

Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest — from below:

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.symbols import nsubj, VERB

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")

# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
print(verbs)

If you try to match from above, you'll have to iterate twice: once for the head, and then again through the children:

# Finding a verb with a subject from above — less good
verbs = []
for possible_verb in doc:
    if possible_verb.pos == VERB:
        for possible_subject in possible_verb.children:
            if possible_subject.dep == nsubj:
                verbs.append(possible_verb)
                break

To iterate through the children, use the token.children attribute, which provides a sequence of Token objects.

A few more convenience attributes are provided for iterating around the local tree from the token. The Token.lefts and Token.rights attributes provide sequences of syntactic children that occur before and after the token. Both sequences are in sentence order. There are also two integer-typed attributes, Token.n_rights and Token.n_lefts , that give the number of left and right children.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"bright red apples on the tree")
print([token.text for token in doc[2].lefts])  # ['bright', 'red']
print([token.text for token in doc[2].rights])  # ['on']
print(doc[2].n_lefts)  # 2
print(doc[2].n_rights)  # 1
Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('de_core_news_sm')
doc = nlp(u"schöne rote Äpfel auf dem Baum")
print([token.text for token in doc[2].lefts])  # ['schöne', 'rote']
print([token.text for token in doc[2].rights])  # ['auf']

You can get a whole phrase by its syntactic head using the Token.subtree attribute. This returns an ordered sequence of tokens. You can walk up the tree with the Token.ancestors attribute, and check dominance with Token.is_ancestor() .

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Credit and mortgage account holders must submit their requests")

root = [token for token in doc if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject is descendant or subject.is_ancestor(descendant)
    print(descendant.text, descendant.dep_, descendant.n_lefts,
          descendant.n_rights,
          [ancestor.text for ancestor in descendant.ancestors])
TextDepn_leftsn_rightsancestors
Creditnmod02holders, submit
andcc00Credit, holders, submit
mortgagecompound00account, Credit, holders, submit
accountconj10Credit, holders, submit
holdersnsubj10submit

Finally, the .left_edge and .right_edge attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase. Note that .right_edge gives a token within the subtree — so if you use it as the end-point of a range, don't forget to +1!

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
span.merge()
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)
TextPOSDepHead text
Credit and mortgage account holdersNOUNnsubjsubmit
mustVERBauxsubmit
submitVERBROOTsubmit
theirADJpossrequests
requestsNOUNdobjsubmit

Visualizing dependencies

The best way to understand spaCy's dependency parser is interactively. To make this easier, spaCy v2.0+ comes with a visualization module. You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup. If you want to know how to write rules that hook into some type of syntactic construction, just plug the sentence into the visualizer and see how spaCy annotates it.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
displacy.render(doc, style='dep', jupyter=True)

Disabling the parser

In the default models, the parser is loaded and enabled as part of the standard processing pipeline. If you don't need any of the syntactic information, you should disable the parser. Disabling the parser will make spaCy load and run much faster. If you want to load the parser, but need to disable it for specific documents, you can also control its use on the nlp object.

nlp = spacy.load('en', disable=['parser'])
nlp = English().from_disk('/model', disable=['parser'])
doc = nlp(u"I don't want parsed", disable=['parser'])

Named Entities
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: named entities.

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

Named Entity Recognition 101

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
TextStartEndLabelDescription
Apple05ORGCompanies, agencies, institutions.
U.K.2731GPEGeopolitical entity, i.e. countries, cities, states.
$1 billion4454MONEYMonetary values, including unit.

Using spaCy's built-in displaCy visualizer, here's what our example sentence and its named entities look like:

Accessing entity annotations

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.label and ent.label_. The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'San Francisco considers banning sidewalk delivery robots')

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

# token level
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_san)  # [u'San', u'B', u'GPE']
print(ent_francisco)  # [u'Francisco', u'I', u'GPE']
Textent_iobent_iob_ent_type_Description
San3BGPEbeginning of an entity
Francisco1IGPEinside an entity
considers2O""outside an entity
banning2O""outside an entity
sidewalk2O""outside an entity
delivery2O""outside an entity
robots2O""outside an entity

Setting entity annotations

To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can't write directly to the token.ent_iob or token.ent_type attributes, so the easiest way to set entities is to assign to the doc.ents attribute and create the new entity as a Span .

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.tokens import Span

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"FB is hiring a new Vice President of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "FB" as an entity :(

ORG = doc.vocab.strings[u'ORG']  # get hash value of entity label
fb_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)
# [(u'FB', 0, 2, 'ORG')] 🎉

Keep in mind that you need to create a Span with the start and end index of the token, not the start and end index of the entity in the document. In this case, "FB" is token (0, 1) – but at the document level, the entity will have the start and end indices (0, 2).

Setting entity annotations from array

You can also assign entity annotations using the doc.from_array() method. To do this, you should include both the ENT_TYPE and the ENT_IOB attributes in the array you're importing from.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import numpy
import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load('en_core_web_sm')
doc = nlp.make_doc(u'London is a big city in the United Kingdom.')
print('Before', list(doc.ents))  # []

header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)))
attr_array[0, 0] = 3  # B
attr_array[0, 1] = doc.vocab.strings[u'GPE']
doc.from_array(header, attr_array)
print('After', list(doc.ents))  # [London]

Setting entity annotations in Cython

Finally, you can always write to the underlying struct, if you compile a Cython function. This is easy to do, and allows you to write efficient native code.

# cython: infer_types=True
from spacy.tokens.doc cimport Doc

cpdef set_entity(Doc doc, int start, int end, int ent_type):
    for i in range(start, end):
        doc.c[i].ent_type = ent_type
    doc.c[start].ent_iob = 3
    for i in range(start+1, end):
        doc.c[i].ent_iob = 2

Obviously, if you write directly to the array of TokenC* structs, you'll have responsibility for ensuring that the data is left in a consistent state.

Built-in entity types

Models trained on the OntoNotes 5 corpus support the following entity types:

TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LAWNamed documents made into laws.
LANGUAGEAny named language.
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including "%".
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL"first", "second", etc.
CARDINALNumerals that do not fall under another type.

Wikipedia scheme

Models trained on Wikipedia corpus (Nothman et al., 2013) use a less fine-grained NER annotation scheme and recognise the following entities:

TypeDescription
PERNamed person or family.
LOC Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains).
ORGNamed corporate, governmental, or other organizational entity.
MISC Miscellaneous entities, e.g. events, nationalities, products or works of art.

Training and updating

To provide training examples to the entity recogniser, you'll first need to create an instance of the GoldParse class. You can specify your annotations in a stand-off format or as token tags. If a character offset in your entity annotations don't fall on a token boundary, the GoldParse class will treat that annotation as a missing value. This allows for more realistic training, because the entity recogniser is allowed to learn from examples that may feature tokenizer errors.

train_data = [('Who is Chaka Khan?', [(7, 17, 'PERSON')]),
              ('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])]
doc = Doc(nlp.vocab, [u'rats', u'make', u'good', u'pets'])
gold = GoldParse(doc, entities=[u'U-ANIMAL', u'O', u'O', u'O'])

The BILUO Scheme

You can also provide token-level entity annotation, using the following tagging scheme to describe the entity boundaries:

TagDescription
B EGINThe first token of a multi-token entity.
I NAn inner token of a multi-token entity.
L ASTThe final token of a multi-token entity.
U NITA single-token entity.
O UTA non-entity token.

spaCy translates the character offsets into this scheme, in order to decide the cost of each action given the current state of the entity recogniser. The costs are then used to calculate the gradient of the loss, to train the model. The exact algorithm is a pastiche of well-known methods, and is not currently described in any single publication. The model is a greedy transition-based parser guided by a linear model whose weights are learned using the averaged perceptron loss, via the dynamic oracle imitation learning strategy. The transition system is equivalent to the BILOU tagging scheme.

Visualizing named entities

The displaCy ENT visualizer lets you explore an entity recognition model's behaviour interactively. If you're training a model, it's very useful to run the visualization yourself. To help you do that, spaCy v2.0+ comes with a visualization module. You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup.

For more details and examples, see the usage guide on visualizing spaCy.

Named Entity example

import spacy from spacy import displacy text = """But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption.""" nlp = spacy.load('custom_ner_model') doc = nlp(text) displacy.serve(doc, style='ent')

Tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a Vocab instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token.text)
012345678910
AppleislookingatbuyingU.K.startupfor$1billion

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

  1. Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
  2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

If there's a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

“Let’s go to N.Y.!” Let’s go to N.Y.!” Let go to N.Y.!” ’s Let go to N.Y.! ’s Let go to N.Y. ’s ! Let go to N.Y. ’s ! EXCEPTION PREFIX SUFFIX SUFFIX EXCEPTION DONE

While punctuation rules are usually pretty general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each available language has its own subclass like English or German, that loads in lists of hard-coded data and exception rules.

Tokenizer data

Global and language-specific tokenizer data is supplied via the language data in spacy/lang . The tokenizer exceptions define special cases like "don't" in English, which needs to be split into two tokens: {ORTH: "do"} and {ORTH: "n't", LEMMA: "not"}. The prefixes, suffixes and infixes mosty define punctuation rules – for example, when to split off periods (at the end of a sentence), and when to leave token containing periods intact (abbreviations like "U.S.").

Tokenizer Base data Language data stop words lexical attributes tokenizer exceptions prefixes, suffixes, infixes lemma data Lemmatizer char classes Token morph rules tag map Morphology

Adding special case tokenization rules

Most domains have at least some idiosyncrasies that require custom tokenization rules. This could be very certain expressions, or abbreviations only used in this specific field.

Here's how to add a special case rule to an existing Tokenizer instance:

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.symbols import ORTH, LEMMA, POS, TAG

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'gimme that')  # phrase to tokenize
print([w.text for w in doc])  # ['gimme', 'that']

# add special case rule
special_case = [{ORTH: u'gim', LEMMA: u'give', POS: u'VERB'}, {ORTH: u'me'}]
nlp.tokenizer.add_special_case(u'gimme', special_case)

# check new tokenization
print([w.text for w in nlp(u'gimme that')])  # ['gim', 'me', 'that']

# Pronoun lemma is returned as -PRON-!
print([w.lemma_ for w in nlp(u'gimme that')])  # ['give', '-PRON-', 'that']

For details on spaCy's custom pronoun lemma -PRON-, see here. The special case doesn't have to match an entire whitespace-delimited substring. The tokenizer will incrementally split off punctuation, and keep looking up the remaining substring:

assert 'gimme' not in [w.text for w in nlp(u'gimme!')]
assert 'gimme' not in [w.text for w in nlp(u'("...gimme...?")')]

The special case rules have precedence over the punctuation splitting:

special_case = [{ORTH: u'...gimme...?', LEMMA: u'give', TAG: u'VB'}]
nlp.tokenizer.add_special_case(u'...gimme...?', special_case)
assert len(nlp(u'...gimme...?')) == 1

Because the special-case rules allow you to set arbitrary token attributes, such as the part-of-speech, lemma, etc, they make a good mechanism for arbitrary fix-up rules. Having this logic live in the tokenizer isn't very satisfying from a design perspective, however, so the API may eventually be exposed on the Language class itself.

How spaCy's tokenizer works

spaCy introduces a novel tokenization algorithm, that gives a better balance between performance, ease of definition, and ease of alignment into the original string.

After consuming a prefix or infix, we consult the special cases again. We want the special cases to handle things like "don't" in English, and we want the same rule to work for "(don't)!". We do this by splitting off the open bracket, then the exclamation, then the close bracket, and finally matching the special-case. Here's an implementation of the algorithm in Python, optimized for readability rather than performance:

def tokenizer_pseudo_code(text, special_cases,
                          find_prefix, find_suffix, find_infixes):
    tokens = []
    for substring in text.split(' '):
        suffixes = []
        while substring:
            if substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ''
            elif find_prefix(substring) is not None:
                split = find_prefix(substring)
                tokens.append(substring[:split])
                substring = substring[split:]
            elif find_suffix(substring) is not None:
                split = find_suffix(substring)
                suffixes.append(substring[-split:])
                substring = substring[:-split]
            elif find_infixes(substring):
                infixes = find_infixes(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[offset : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
                substring = substring[offset:]
            else:
                tokens.append(substring)
                substring = ''
        tokens.extend(reversed(suffixes))
    return tokens

The algorithm can be summarized as follows:

  1. Iterate over space-separated substrings
  2. Check whether we have an explicitly defined rule for this substring. If we do, use it.
  3. Otherwise, try to consume a prefix.
  4. If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
  5. If we didn't consume a prefix, try to consume a suffix.
  6. If we can't consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.
  7. Once we can't consume any more of the string, handle it as a single token.

Customizing spaCy's Tokenizer class

Let's imagine you wanted to create a tokenizer for a new language or specific domain. There are five things you would need to define:

  1. A dictionary of special cases. This handles things like contractions, units of measurement, emoticons, certain abbreviations, etc.
  2. A function prefix_search, to handle preceding punctuation, such as open quotes, open brackets, etc
  3. A function suffix_search, to handle succeeding punctuation, such as commas, periods, close quotes, etc.
  4. A function infixes_finditer, to handle non-whitespace separators, such as hyphens etc.
  5. An optional boolean function token_match matching strings that should never be split, overriding the previous rules. Useful for things like URLs or numbers.

You shouldn't usually need to create a Tokenizer subclass. Standard usage is to use re.compile() to build a regular expression object, and pass its .search() and .finditer() methods:

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import re
import spacy
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=simple_url_re.match)

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world.")
print([t.text for t in doc])

If you need to subclass the tokenizer instead, the relevant methods to specialize are find_prefix, find_suffix and find_infix.

Hooking an arbitrary tokenizer into the pipeline

The tokenizer is the first component of the processing pipeline and the only one that can't be replaced by writing to nlp.pipeline. This is because it has a different signature from all the other components: it takes a text and returns a Doc, whereas all other components expect to already receive a tokenized Doc.

Doc Text nlp tokenizer tagger parser ner ...

To overwrite the existing tokenizer, you need to replace nlp.tokenizer with a custom function that takes a text, and returns a Doc.

nlp = spacy.load('en')
nlp.tokenizer = my_tokenizer
ArgumentTypeDescription
textunicodeThe raw text to tokenize.
returnsDocThe tokenized document.

Example: A custom whitespace tokenizer

To construct the tokenizer, we usually want attributes of the nlp pipeline. Specifically, we want the tokenizer to hold a reference to the vocabulary object. Let's say we have the following class as our tokenizer:

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(' ')
        # All tokens 'own' a subsequent space character in this tokenizer
        spaces = [True] * len(words)
        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(u"What's happened to me? he thought. It wasn't a dream.")
print([t.text for t in doc])

As you can see, we need a Vocab instance to construct this — but we won't have it until we get back the loaded nlp object. The simplest solution is to build the tokenizer in two steps. This also means that you can reuse the "tokenizer factory" and initialise it with different instances of Vocab.

Bringing your own annotations

spaCy generally assumes by default that your data is raw text. However, sometimes your data is partially annotated, e.g. with pre-existing tokenization, part-of-speech tags, etc. The most common situation is that you have pre-defined tokenization. If you have a list of strings, you can create a Doc object directly. Optionally, you can also specify a list of boolean values, indicating whether each word has a subsequent space.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.tokens import Doc
from spacy.lang.en import English

nlp = English()
doc = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'],
          spaces=[False, True, False, False])
print([(t.text, t.text_with_ws, t.whitespace_) for t in doc])

If provided, the spaces list must be the same length as the words list. The spaces list affects the doc.text, span.text, token.idx, span.start_char and span.end_char attributes. If you don't provide a spaces sequence, spaCy will assume that all words are whitespace delimited.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.tokens import Doc
from spacy.lang.en import English

nlp = English()
bad_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'])
good_spaces = Doc(nlp.vocab, words=[u'Hello', u',', u'world', u'!'],
                  spaces=[False, True, False, False])

print(bad_spaces.text)   # 'Hello , world !'
print(good_spaces.text)  # 'Hello, world!'

Once you have a Doc object, you can write to its attributes to set the part-of-speech tags, syntactic dependencies, named entities and other attributes. For details, see the respective usage pages.

Sentence Segmentation

A Doc object's sentences are available via the Doc.sents property. Unlike other libraries, spaCy uses the dependency parse to determine sentence boundaries. This is usually more accurate than a rule-based approach, but it also means you'll need a statistical model and accurate predictions. If your texts are closer to general-purpose news or web text, this should work well out-of-the-box. For social media or conversational text that doesn't follow the same rules, your application may benefit from a custom rule-based implementation. You can either plug a rule-based component into your processing pipeline or use the SentenceSegmenter component with a custom stategy.

Default: Using the dependency parse
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: dependency parser.

To view a Doc's sentences, you can iterate over the Doc.sents, a generator that yields Span objects.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)

Setting boundaries manually

spaCy's dependency parser respects already set boundaries, so you can preprocess your Doc using custom rules before it's parsed. This can be done by adding a custom pipeline component. Depending on your text, this may also improve accuracy, since the parser is constrained to predict parses consistent with the sentence boundaries.

Here's an example of a component that implements a pre-processing rule for splitting on '...' tokens. The component is added before the parser, which is then used to further segment the text. This approach can be useful if you want to implement additional rules specific to your data, while still being able to take advantage of dependency-based sentence segmentation.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy

text = u"this is a sentence...hello...and another sentence."

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print('Before:', [sent.text for sent in doc.sents])

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before='parser')
doc = nlp(text)
print('After:', [sent.text for sent in doc.sents])

Rule-based pipeline component

The sentencizer component is a pipeline component that splits sentences on punctuation like ., ! or ?. You can plug it into your pipeline if you only need sentence boundaries without the dependency parse. Note that Doc.sents will raise an error if no sentence boundaries are set.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.lang.en import English

nlp = English()  # just the language with no model
sbd = nlp.create_pipe('sentencizer')   # or: nlp.create_pipe('sbd')
nlp.add_pipe(sbd)
doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)

Custom rule-based strategy

If you want to implement your own strategy that differs from the default rule-based approach of splitting on sentences, you can also instantiate the SentenceSegmenter directly and pass in your own strategy. The strategy should be a function that takes a Doc object and yields a Span for each sentence. Here's an example of a custom segmentation strategy for splitting on newlines only:

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
from spacy.lang.en import English
from spacy.pipeline import SentenceSegmenter

def split_on_newlines(doc):
    start = 0
    seen_newline = False
    for word in doc:
        if seen_newline and not word.is_space:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text == '\n':
            seen_newline = True
    if start < len(doc):
        yield doc[start:len(doc)]

nlp = English()  # just the language with no model
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)
doc = nlp(u"This is a sentence\n\nThis is another sentence\nAnd more")
for sent in doc.sents:
    print([token.text for token in sent])

Rule-based matching

spaCy features a rule-matching engine, the Matcher , that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_, and flags (e.g. IS_PUNCT). The rule matcher also lets you pass in a custom callback to act on matches – for example, to merge entities and apply custom labels. You can also associate patterns with entity IDs, to allow some basic entity linking or disambiguation. To match large terminology lists, you can use the PhraseMatcher , which accepts Doc objects as match patterns.

Adding patterns

Let's say we want to enable spaCy to find a combination of three tokens:

  1. A token whose lowercase form matches "hello", e.g. "Hello" or "HELLO".
  2. A token whose is_punct flag is set to True, i.e. any punctuation.
  3. A token whose lowercase form matches "world", e.g. "World" or "WORLD".
[{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]

First, we initialise the Matcher with a vocab. The matcher must always share the same vocab with the documents it will operate on. We can now call matcher.add() with an ID and our custom pattern. The second argument lets you pass in an optional callback function to invoke on a successful match. For now, we set it to None.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
# add match ID "HelloWorld" with no callback and one pattern
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher.add('HelloWorld', None, pattern)

doc = nlp(u'Hello, world! Hello world!')
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]  # the matched span
    print(match_id, string_id, start, end, span.text)

The matcher returns a list of (match_id, start, end) tuples – in this case, [('15578876784678163569', 0, 2)], which maps to the span doc[0:2] of our original document. The match_id is the hash value of the string ID "HelloWorld". To get the string value, you can look up the ID in the StringStore .

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # 'HelloWorld'
    span = doc[start:end]                    # the matched span

Optionally, we could also choose to add more than one pattern, for example to also match sequences without punctuation between "hello" and "world":

matcher.add('HelloWorld', None,
            [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}],
            [{'LOWER': 'hello'}, {'LOWER': 'world'}])

By default, the matcher will only return the matches and not do anything else, like merge entities or assign labels. This is all up to you and can be defined individually for each pattern, by passing in a callback function as the on_match argument on add(). This is useful, because it lets you write entirely custom and pattern-specific logic. For example, you might want to merge some patterns into one token, while adding entity labels for other pattern types. You shouldn't have to create different matchers for each of those processes.

Available token attributes

The available token pattern keys are uppercase versions of the Token attributes . The most relevant ones for rule-based matching are:

AttributeDescription
ORTHThe exact verbatim text of a token.
LOWERThe lowercase form of the token text.
LENGTHThe length of the token text.
IS_ALPHA, IS_ASCII, IS_DIGIT Token text consists of alphanumeric characters, ASCII characters, digits.
IS_LOWER, IS_UPPER, IS_TITLEToken text is in lowercase, uppercase, titlecase.
IS_PUNCT, IS_SPACE, IS_STOPToken is punctuation, whitespace, stop word.
LIKE_NUM, LIKE_URL, LIKE_EMAILToken text resembles a number, URL, email.
POS, TAG, DEP, LEMMA, SHAPE The token's simple and extended part-of-speech tag, dependency label, lemma, shape.
ENT_TYPEThe token's entity label.

Using wildcard token patterns
v2.0 This feature is new and was introduced in spaCy v2.0

While the token attributes offer many options to write highly specific patterns, you can also use an empty dictionary, {} as a wildcard representing any token. This is useful if you know the context of what you're trying to match, but very little about the specific token and its characters. For example, let's say you're trying to extract people's user names from your data. All you know is that they are listed as "User name: {username}". The name itself may contain any character, but no whitespace – so you'll know it will be handled as one token.

[{'ORTH': 'User'}, {'ORTH': 'name'}, {'ORTH': ':'}, {}]

Using operators and quantifiers

The matcher also lets you use quantifiers, specified as the 'OP' key. Quantifiers let you define sequences of tokens to be mached, e.g. one or more punctuation marks, or specify optional tokens. Note that there are no nested or scoped quantifiers – instead, you can build those behaviours with on_match callbacks.

OPDescription
!Negate the pattern, by requiring it to match exactly 0 times.
?Make the pattern optional, by allowing it to match 0 or 1 times.
+Require the pattern to match 1 or more times.
*Allow the pattern to match zero or more times.

In versions before v2.1.0, the semantics of the + and * operators behave inconsistently. They were usually interpreted "greedily", i.e. longer matches are returned where possible. However, if you specify two + and * patterns in a row and their matches overlap, the first operator will behave non-greedily. This quirk in the semantics is corrected in spaCy v2.1.0.

Adding phrase patterns

If you need to match large terminology lists, you can also use the PhraseMatcher and create Doc objects instead of token patterns, which is much more efficient overall. The Doc patterns can contain single or multiple tokens.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')
matcher = PhraseMatcher(nlp.vocab)
terminology_list = ['Barack Obama', 'Angela Merkel', 'Washington, D.C.']
patterns = [nlp(text) for text in terminology_list]
matcher.add('TerminologyList', None, *patterns)

doc = nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
          u"converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

Since spaCy is used for processing both the patterns and the text to be matched, you won't have to worry about specific tokenization – for example, you can simply pass in nlp(u"Washington, D.C.") and won't have to write a complex token pattern covering the exact tokenization of the term.

Adding on_match rules

To move on to a more realistic example, let's say you're working with a large corpus of blog articles, and you want to match all mentions of "Google I/O" (which spaCy tokenizes as ['Google', 'I', '/', 'O']). To be safe, you only match on the uppercase versions, in case someone has written it as "Google i/o". You also add a second pattern with an added {IS_DIGIT: True} token – this will make sure you also match on "Google I/O 2017". If your pattern matches, spaCy should execute your custom callback function add_event_ent.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

# Get the ID of the 'EVENT' entity type. This is required to set an entity.
EVENT = nlp.vocab.strings['EVENT']

def add_event_ent(matcher, doc, i, matches):
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    match_id, start, end = matches[i]
    entity = (EVENT, start, end)
    doc.ents += (entity,)
    print(doc[start:end].text, entity)

matcher.add('GoogleIO', add_event_ent,
            [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}],
            [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}, {'IS_DIGIT': True}])

doc = nlp(u"This is a text about Google I/O 2015.")
matches = matcher(doc)

We can now call the matcher on our documents. The patterns will be matched in the order they occur in the text. The matcher will then iterate over the matches, look up the callback for the match ID that was matched, and invoke it.

doc = nlp(YOUR_TEXT_HERE)
matcher(doc)

When the callback is invoked, it is passed four arguments: the matcher itself, the document, the position of the current match, and the total list of matches. This allows you to write callbacks that consider the entire set of matched phrases, so that you can resolve overlaps and other conflicts in whatever way you prefer.

ArgumentTypeDescription
matcherMatcherThe matcher instance.
docDocThe document the matcher was used on.
iintIndex of the current match (matches[i]).
matcheslist A list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end].

Using custom pipeline components

Let's say your data also contains some annoying pre-processing artefacts, like leftover HTML line breaks (e.g. <br> or <BR/>). To make your text easier to analyse, you want to merge those into one token and flag them, to make sure you can ignore them later. Ideally, this should all be done automatically as you process the text. You can achieve this by adding a custom pipeline component that's called on each Doc object, merges the leftover HTML spans and sets an attribute bad_html on the token.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token

# we're using a class because the component needs to be initialised with
# the shared vocab via the nlp object
class BadHTMLMerger(object):
    def __init__(self, nlp):
        # register a new token extension to flag bad HTML
        Token.set_extension('bad_html', default=False)
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add('BAD_HTML', None,
            [{'ORTH': '<'}, {'LOWER': 'br'}, {'ORTH': '>'}],
            [{'ORTH': '<'}, {'LOWER': 'br/'}, {'ORTH': '>'}])

    def __call__(self, doc):
        # this method is invoked when the component is called on a Doc
        matches = self.matcher(doc)
        spans = []  # collect the matched spans here
        for match_id, start, end in matches:
            spans.append(doc[start:end])
        for span in spans:
            span.merge()   # merge
            for token in span:
                token._.bad_html = True  # mark token as bad HTML
        return doc

nlp = spacy.load('en_core_web_sm')
html_merger = BadHTMLMerger(nlp)
nlp.add_pipe(html_merger, last=True)  # add component to the pipeline
doc = nlp(u"Hello<br>world! <br/> This is a test.")
for token in doc:
    print(token.text, token._.bad_html)

Instead of hard-coding the patterns into the component, you could also make it take a path to a JSON file containing the patterns. This lets you reuse the component with different patterns, depending on your application:

html_merger = BadHTMLMerger(nlp, path='/path/to/patterns.json')

Using regular expressions

In some cases, only matching tokens and token attributes isn't enough – for example, you might want to match different spellings of a word, without having to add a new pattern for each spelling. A simple solution is to match a regular expression on the Doc's text and use the Doc.char_span method to create a Span from the character indices of the match:

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
import re

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')

DEFINITELY_PATTERN = re.compile(r'deff?in[ia]tely')

for match in re.finditer(DEFINITELY_PATTERN, doc.text):
    start, end = match.span()         # get matched indices
    span = doc.char_span(start, end)  # create Span from indices
    print(span.text)

You can also use the regular expression with spaCy's Matcher by converting it to a token flag. To ensure efficiency, the Matcher can only access the C-level data. This means that it can either use built-in token attributes or binary flags. Vocab.add_flag returns a flag ID which you can use as a key of a token match pattern. Tokens that match the regular expression will return True for the IS_DEFINITELY flag.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.matcher import Matcher
import re

nlp = spacy.load('en_core_web_sm')
definitely_flag = lambda text: bool(re.compile(r'deff?in[ia]tely').match(text))
IS_DEFINITELY = nlp.vocab.add_flag(definitely_flag)

matcher = Matcher(nlp.vocab)
matcher.add('DEFINITELY', None, [{IS_DEFINITELY: True}])

doc = nlp(u'The spelling is "definitely", not "definately" or "deffinitely".')
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

Providing the regular expressions as binary flags also lets you use them in combination with other token patterns – for example, to match the word "definitely" in various spellings, followed by a case-insensitive "not" and and adjective:

[{IS_DEFINITELY: True}, {'LOWER': 'not'}, {'POS': 'ADJ'}]

Example: Using linguistic annotations

Let's say you're analysing user comments and you want to find out what people are saying about Facebook. You want to start off by finding adjectives following "Facebook is" or "Facebook was". This is obviously a very rudimentary solution, but it'll be fast, and a great way get an idea for what's in your data. Your pattern could look like this:

[{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}]

This translates to a token whose lowercase form matches "facebook" (like Facebook, facebook or FACEBOOK), followed by a token with the lemma "be" (for example, is, was, or 's), followed by an optional adverb, followed by an adjective. Using the linguistic annotations here is especially useful, because you can tell spaCy to match "Facebook's annoying", but not "Facebook's annoying ads". The optional adverb makes sure you won't miss adjectives with intensifiers, like "pretty awful" or "very nice".

To get a quick overview of the results, you could collect all sentences containing a match and render them with the displaCy visualizer. In the callback function, you'll have access to the start and end of each match, as well as the parent Doc. This lets you determine the sentence containing the match, doc[start : end].sent, and calculate the start and end of the matched span within the sentence. Using displaCy in "manual" mode lets you pass in a list of dictionaries containing the text and entities to render.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy import displacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
matched_sents = [] # collect data of matched sentences to be visualized

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end]  # matched span
    sent = span.sent  # sentence containing matched span
    # append mock entity for match in displaCy style to matched_sents
    # get the match span by ofsetting the start and end of the span with the
    # start and end of the sentence in the doc
    match_ents = [{'start': span.start_char - sent.start_char,
                   'end': span.end_char - sent.start_char,
                   'label': 'MATCH'}]
    matched_sents.append({'text': sent.text, 'ents': match_ents })

pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'}]
matcher.add('FacebookIs', collect_sents, pattern)  # add pattern
doc = nlp(u"I'd say that Facebook is evil. – Facebook is pretty cool, right?")
matches = matcher(doc)

# serve visualization of sentences containing match with displaCy
# set manual=True to make displaCy render straight from a dictionary
# (if you're not running the code within a Jupyer environment, you can
# remove jupyter=True and use displacy.serve instead)
displacy.render(matched_sents, style='ent', manual=True, jupyter=True)

Example: Phone numbers

Phone numbers can have many different formats and matching them is often tricky. During tokenization, spaCy will leave sequences of numbers intact and only split on whitespace and punctuation. This means that your match pattern will have to look out for number sequences of a certain length, surrounded by specific punctuation – depending on the national conventions.

The IS_DIGIT flag is not very helpful here, because it doesn't tell us anything about the length. However, you can use the SHAPE flag, with each d representing a digit:

[{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'dddd'},
 {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'dddd'}]

This will match phone numbers of the format (123) 4567 8901 or (123) 4567-8901. To also match formats like (123) 456 789, you can add a second pattern using 'ddd' in place of 'dddd'. By hard-coding some values, you can match only certain, country-specific numbers. For example, here's a pattern to match the most common formats of international German numbers:

[{'ORTH': '+'}, {'ORTH': '49'}, {'ORTH': '(', 'OP': '?'}, {'SHAPE': 'dddd'},
 {'ORTH': ')', 'OP': '?'}, {'SHAPE': 'dddddd'}]

Depending on the formats your application needs to match, creating an extensive set of rules like this is often better than training a model. It'll produce more predictable results, is much easier to modify and extend, and doesn't require any training data – only a set of test cases.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': '('}, {'SHAPE': 'ddd'}, {'ORTH': ')'}, {'SHAPE': 'ddd'},
           {'ORTH': '-', 'OP': '?'}, {'SHAPE': 'ddd'}]
matcher.add('PHONE_NUMBER', None, pattern)

doc = nlp(u"Call me at (123) 456 789 or (123) 456 789!")
print([t.text for t in doc])
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

Example: Hashtags and emoji on social media

Social media posts, especially tweets, can be difficult to work with. They're very short and often contain various emoji and hashtags. By only looking at the plain text, you'll lose a lot of valuable semantic information.

Let's say you've extracted a large sample of social media posts on a specific topic, for example posts mentioning a brand name or product. As the first step of your data exploration, you want to filter out posts containing certain emoji and use them to assign a general sentiment score, based on whether the expressed emotion is positive or negative, e.g. 😀 or 😞. You also want to find, merge and label hashtags like #MondayMotivation, to be able to ignore or analyse them later.

By default, spaCy's tokenizer will split emoji into separate tokens. This means that you can create a pattern for one or more emoji tokens. Valid hashtags usually consist of a #, plus a sequence of ASCII characters with no whitespace, making them easy to match as well.

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
from spacy.lang.en import English
from spacy.matcher import Matcher

nlp = English()  # we only want the tokenizer, so no need to load a model
matcher = Matcher(nlp.vocab)

pos_emoji = [u'😀', u'😃', u'😂', u'🤣', u'😊', u'😍']  # positive emoji
neg_emoji = [u'😞', u'😠', u'😩', u'😢', u'😭', u'😒']  # negative emoji

# add patterns to match one or more emoji tokens
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
neg_patterns = [[{'ORTH': emoji}] for emoji in neg_emoji]

# function to label the sentiment
def label_sentiment(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    if doc.vocab.strings[match_id] == 'HAPPY':  # don't forget to get string!
        doc.sentiment += 0.1  # add 0.1 for positive sentiment
    elif doc.vocab.strings[match_id] == 'SAD':
        doc.sentiment -= 0.1  # subtract 0.1 for negative sentiment

matcher.add('HAPPY', label_sentiment, *pos_patterns)  # add positive pattern
matcher.add('SAD', label_sentiment, *neg_patterns)  # add negative pattern

# add pattern for valid hashtag, i.e. '#' plus any ASCII token
matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])

doc = nlp(u"Hello world 😀 #MondayMotivation")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = doc.vocab.strings[match_id]  # look up string ID
    span = doc[start:end]
    print(string_id, span.text)

Because the on_match callback receives the ID of each match, you can use the same function to handle the sentiment assignment for both the positive and negative pattern. To keep it simple, we'll either add or subtract 0.1 points – this way, the score will also reflect combinations of emoji, even positive and negative ones.

With a library like Emojipedia, we can also retrieve a short description for each emoji – for example, 😍's official title is "Smiling Face With Heart-Eyes". Assigning it to a custom attribute on the emoji span will make it available as span._.emoji_desc.

from emojipedia import Emojipedia  # installation: pip install emojipedia
from spacy.tokens import Span  # get the global Span object

Span.set_extension('emoji_desc', default=None)  # register the custom attribute

def label_sentiment(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    if doc.vocab.strings[match_id] == 'HAPPY':  # don't forget to get string!
        doc.sentiment += 0.1  # add 0.1 for positive sentiment
    elif doc.vocab.strings[match_id] == 'SAD':
        doc.sentiment -= 0.1  # subtract 0.1 for negative sentiment
    span = doc[start : end]
    emoji = Emojipedia.search(span[0].text) # get data for emoji
    span._.emoji_desc = emoji.title  # assign emoji description

To label the hashtags, we can use a custom attribute set on the respective token:

Editable code example (experimental)
v2.0.12 · Python 3 · via Binder
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

# add pattern for valid hashtag, i.e. '#' plus any ASCII token
matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])

# register token extension
Token.set_extension('is_hashtag', default=False)

doc = nlp(u"Hello world 😀 #MondayMotivation")
matches = matcher(doc)
hashtags = []
for match_id, start, end in matches:
    if doc.vocab.strings[match_id] == 'HASHTAG':
        hashtags.append(doc[start:end])
for span in hashtags:
    span.merge()
    for token in span:
        token._.is_hashtag = True

for token in doc:
    print(token.text, token._.is_hashtag)

To process a stream of social media posts, we can use Language.pipe() , which will return a stream of Doc objects that we can pass to Matcher.pipe() .

docs = nlp.pipe(LOTS_OF_TWEETS)
matches = matcher.pipe(docs)