Guides

Linguistic Features

Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information. That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

Part-of-speech tagging Needs model

After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

Linguistic annotations are available as Token attributes. Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)
TextLemmaPOSTagDepShapealphastop
AppleapplePROPNNNPnsubjXxxxxTrueFalse
isbeVERBVBZauxxxTrueTrue
lookinglookVERBVBGROOTxxxxTrueFalse
atatADPINprepxxTrueTrue
buyingbuyVERBVBGpcompxxxxTrueFalse
U.K.u.k.PROPNNNPcompoundX.X.FalseFalse
startupstartupNOUNNNdobjxxxxTrueFalse
forforADPINprepxxxTrueTrue
$$SYM$quantmod$FalseFalse
11NUMCDcompounddFalseFalse
billionbillionNUMCDprobjxxxxTrueFalse

Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its dependencies look like:

Rule-based morphology

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not changes its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:

ContextSurfaceLemmaPOS Morphological Features
I was reading the paperreadingreadverbVerbForm=Ger
I don’t watch the news, I read the paperreadreadverbVerbForm=Fin, Mood=Ind, Tense=Pres
I read the paper yesterdayreadreadverbVerbForm=Fin, Mood=Ind, Tense=Past

English has a relatively simple morphological system, which spaCy handles using rules that can be keyed by the token, the part-of-speech tag, or the combination of the two. The system works as follows:

  1. The tokenizer consults a mapping tableTOKENIZER_EXCEPTIONS, which allows sequences of characters to be mapped to multiple tokens. Each token may be assigned a part of speech and one or more morphological features.
  2. The part-of-speech tagger then assigns each token an extended POS tag. In the API, these tags are known as Token.tag. They express the part-of-speech (e.g. VERB) and some amount of morphological information, e.g. that the verb is past tense.
  3. For words whose POS is not set by a prior process, a mapping table TAG_MAP maps the tags to a part-of-speech and a set of morphological features.
  4. Finally, a rule-based deterministic lemmatizer maps the surface form, to a lemma in light of the previously assigned extended part-of-speech and morphological information, without consulting the context of the token. The lemmatizer also accepts list-based exception files, acquired from WordNet.

Dependency Parsing Needs model

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or “chunks”. You can check whether a Doc object has been parsed with the doc.is_parsed attribute, which returns a boolean value. If this attribute is False, the default sentence iterator will raise an exception.

Noun chunks

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”. To get the noun chunks in a document, simply iterate over Doc.noun_chunks

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)
Textroot.textroot.dep_root.head.text
Autonomous carscarsnsubjshift
insurance liabilityliabilitydobjshift
manufacturersmanufacturerspobjtoward

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep_.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])
TextDepHead textHead POSChildren
AutonomousamodcarsNOUN
carsnsubjshiftVERBAutonomous
shiftROOTshiftVERBcars, liability, toward
insurancecompoundliabilityNOUN
liabilitydobjshiftVERBinsurance
towardprepshiftNOUNmanufacturers
manufacturerspobjtowardADP

Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest — from below:

import spacy
from spacy.symbols import nsubj, VERB

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")

# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
print(verbs)

If you try to match from above, you’ll have to iterate twice. Once for the head, and then again through the children:

# Finding a verb with a subject from above — less good
verbs = []
for possible_verb in doc:
    if possible_verb.pos == VERB:
        for possible_subject in possible_verb.children:
            if possible_subject.dep == nsubj:
                verbs.append(possible_verb)
                break

To iterate through the children, use the token.children attribute, which provides a sequence of Token objects.

A few more convenience attributes are provided for iterating around the local tree from the token. Token.lefts and Token.rights attributes provide sequences of syntactic children that occur before and after the token. Both sequences are in sentence order. There are also two integer-typed attributes, Token.n_lefts and Token.n_rights that give the number of left and right children.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"bright red apples on the tree")
print([token.text for token in doc[2].lefts])  # ['bright', 'red']
print([token.text for token in doc[2].rights])  # ['on']
print(doc[2].n_lefts)  # 2
print(doc[2].n_rights)  # 1
import spacy

nlp = spacy.load("de_core_news_sm")
doc = nlp(u"schöne rote Äpfel auf dem Baum")
print([token.text for token in doc[2].lefts])  # ['schöne', 'rote']
print([token.text for token in doc[2].rights])  # ['auf']

You can get a whole phrase by its syntactic head using the Token.subtree attribute. This returns an ordered sequence of tokens. You can walk up the tree with the Token.ancestors attribute, and check dominance with Token.is_ancestor

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Credit and mortgage account holders must submit their requests")

root = [token for token in doc if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    assert subject is descendant or subject.is_ancestor(descendant)
    print(descendant.text, descendant.dep_, descendant.n_lefts,
            descendant.n_rights,
            [ancestor.text for ancestor in descendant.ancestors])
TextDepn_leftsn_rightsancestors
Creditnmod02holders, submit
andcc00holders, submit
mortgagecompound00account, Credit, holders, submit
accountconj10Credit, holders, submit
holdersnsubj10submit

Finally, the .left_edge and .right_edge attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase. Note that .right_edge gives a token within the subtree — so if you use it as the end-point of a range, don’t forget to +1!

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Credit and mortgage account holders must submit their requests")
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
with doc.retokenize() as retokenizer:
    retokenizer.merge(span)
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)
Text POSDepHead text
Credit and mortgage account holdersNOUNnsubjsubmit
mustVERBauxsubmit
submitVERBROOTsubmit
theirADJpossrequests
requestsNOUNdobjsubmit

Visualizing dependencies

The best way to understand spaCy’s dependency parser is interactively. To make this easier, spaCy v2.0+ comes with a visualization module. You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup. If you want to know how to write rules that hook into some type of syntactic construction, just plug the sentence into the visualizer and see how spaCy annotates it.

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc, style='dep')

Disabling the parser

In the default models, the parser is loaded and enabled as part of the standard processing pipeline. If you don’t need any of the syntactic information, you should disable the parser. Disabling the parser will make spaCy load and run much faster. If you want to load the parser, but need to disable it for specific documents, you can also control its use on the nlp object.

nlp = spacy.load("en_core_web_sm", disable=["parser"])
nlp = English().from_disk("/model", disable=["parser"])
doc = nlp(u"I don't want parsed", disable=["parser"])

Named Entity Recognition

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

Named Entity Recognition 101

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognizevarious types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
TextStartEndLabelDescription
Apple05ORGCompanies, agencies, institutions.
U.K.2731GPEGeopolitical entity, i.e. countries, cities, states.
$1 billion4454MONEYMonetary values, including unit.

Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its named entities look like:

Accessing entity annotations

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.label and ent.label_. The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"San Francisco considers banning sidewalk delivery robots")

# document level
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

# token level
ent_san = [doc[0].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_francisco = [doc[1].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_san)  # [u'San', u'B', u'GPE']
print(ent_francisco)  # [u'Francisco', u'I', u'GPE']
Textent_iobent_iob_ent_type_Description
San3B"GPE"beginning of an entity
Francisco1I"GPE"inside an entity
considers2O""outside an entity
banning2O""outside an entity
sidewalk2O""outside an entity
delivery2O""outside an entity
robots2O""outside an entity

Setting entity annotations

To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can’t write directly to the token.ent_iob or token.ent_type attributes, so the easiest way to set entities is to assign to the doc.ents attribute and create the new entity as a Span.

import spacy
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"FB is hiring a new Vice President of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "FB" as an entity :(

ORG = doc.vocab.strings[u"ORG"]  # get hash value of entity label
fb_ent = Span(doc, 0, 1, label=ORG) # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent]

ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('After', ents)
# [(u'FB', 0, 2, 'ORG')] 🎉

Keep in mind that you need to create a Span with the start and end index of the token, not the start and end index of the entity in the document. In this case, “FB” is token (0, 1) – but at the document level, the entity will have the start and end indices (0, 2).

Setting entity annotations from array

You can also assign entity annotations using the doc.from_array method. To do this, you should include both the ENT_TYPE and the ENT_IOB attributes in the array you’re importing from.

import numpy
import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE

nlp = spacy.load("en_core_web_sm")
doc = nlp.make_doc(u"London is a big city in the United Kingdom.")
print("Before", doc.ents)  # []

header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)))
attr_array[0, 0] = 3  # B
attr_array[0, 1] = doc.vocab.strings[u"GPE"]
doc.from_array(header, attr_array)
print("After", doc.ents)  # [London]

Setting entity annotations in Cython

Finally, you can always write to the underlying struct, if you compile a Cython function. This is easy to do, and allows you to write efficient native code.

# cython: infer_types=True
from spacy.tokens.doc cimport Doc

cpdef set_entity(Doc doc, int start, int end, int ent_type):
    for i in range(start, end):
        doc.c[i].ent_type = ent_type
    doc.c[start].ent_iob = 3
    for i in range(start+1, end):
        doc.c[i].ent_iob = 2

Obviously, if you write directly to the array of TokenC* structs, you’ll have responsibility for ensuring that the data is left in a consistent state.

Built-in entity types

Training and updating

To provide training examples to the entity recognizer, you’ll first need to create an instance of the GoldParse class. You can specify your annotations in a stand-off format or as token tags. If a character offset in your entity annotations doesn’t fall on a token boundary, the GoldParse class will treat that annotation as a missing value. This allows for more realistic training, because the entity recognizer is allowed to learn from examples that may feature tokenizer errors.

train_data = [
    ("Who is Chaka Khan?", [(7, 17, "PERSON")]),
    ("I like London and Berlin.", [(7, 13, "LOC"), (18, 24, "LOC")]),
]
doc = Doc(nlp.vocab, [u"rats", u"make", u"good", u"pets"])
gold = GoldParse(doc, entities=[u"U-ANIMAL", u"O", u"O", u"O"])

Visualizing named entities

The displaCy ENT visualizer lets you explore an entity recognition model’s behavior interactively. If you’re training a model, it’s very useful to run the visualization yourself. To help you do that, spaCy v2.0+ comes with a visualization module. You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup.

For more details and examples, see the usage guide on visualizing spaCy.

Named Entity example

import spacy from spacy import displacy text = """But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption.""" nlp = spacy.load("custom_ner_model") doc = nlp(text) displacy.serve(doc, style="ent")

Tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a Vocab instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)
012345678910
AppleislookingatbuyingU.K.startupfor$1billion

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

  1. Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.

  2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

If there’s a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

Example of the tokenization process

While punctuation rules are usually pretty general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each available language has its own subclass like English or German, that loads in lists of hard-coded data and exception rules.

Tokenizer data

Global and language-specific tokenizer data is supplied via the language data in spacy/lang. The tokenizer exceptions define special cases like “don’t” in English, which needs to be split into two tokens: {ORTH: "do"} and {ORTH: "n't", LEMMA: "not"}. The prefixes, suffixes and infixes mostly define punctuation rules – for example, when to split off periods (at the end of a sentence), and when to leave tokens containing periods intact (abbreviations like “U.S.”).

Language data architecture

Tokenization rules that are specific to one language, but can be generalized across that language should ideally live in the language data in spacy/lang – we always appreciate pull requests! Anything that’s specific to a domain or text type – like financial trading abbreviations, or Bavarian youth slang – should be added as a special case rule to your tokenizer instance. If you’re dealing with a lot of customizations, it might make sense to create an entirely custom subclass.


Adding special case tokenization rules

Most domains have at least some idiosyncrasies that require custom tokenization rules. This could be very certain expressions, or abbreviations only used in this specific field. Here’s how to add a special case rule to an existing Tokenizer instance:

import spacy
from spacy.symbols import ORTH, LEMMA, POS, TAG

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"gimme that")  # phrase to tokenize
print([w.text for w in doc])  # ['gimme', 'that']

# add special case rule
special_case = [{ORTH: u"gim", LEMMA: u"give", POS: u"VERB"}, {ORTH: u"me"}]
nlp.tokenizer.add_special_case(u"gimme", special_case)

# check new tokenization
print([w.text for w in nlp(u"gimme that")])  # ['gim', 'me', 'that']

# Pronoun lemma is returned as -PRON-!
print([w.lemma_ for w in nlp(u"gimme that")])  # ['give', '-PRON-', 'that']

The special case doesn’t have to match an entire whitespace-delimited substring. The tokenizer will incrementally split off punctuation, and keep looking up the remaining substring:

assert "gimme" not in [w.text for w in nlp(u"gimme!")]
assert "gimme" not in [w.text for w in nlp(u'("...gimme...?")')]

The special case rules have precedence over the punctuation splitting:

special_case = [{ORTH: u"...gimme...?", LEMMA: u"give", TAG: u"VB"}]
nlp.tokenizer.add_special_case(u"...gimme...?", special_case)
assert len(nlp(u"...gimme...?")) == 1

Because the special-case rules allow you to set arbitrary token attributes, such as the part-of-speech, lemma, etc, they make a good mechanism for arbitrary fix-up rules. Having this logic live in the tokenizer isn’t very satisfying from a design perspective, however, so the API may eventually be exposed on the Language class itself.

How spaCy’s tokenizer works

spaCy introduces a novel tokenization algorithm, that gives a better balance between performance, ease of definition, and ease of alignment into the original string.

After consuming a prefix or infix, we consult the special cases again. We want the special cases to handle things like “don’t” in English, and we want the same rule to work for “(don’t)!“. We do this by splitting off the open bracket, then the exclamation, then the close bracket, and finally matching the special-case. Here’s an implementation of the algorithm in Python, optimized for readability rather than performance:

def tokenizer_pseudo_code(text, special_cases,
                          find_prefix, find_suffix, find_infixes):
    tokens = []
    for substring in text.split(' '):
        suffixes = []
        while substring:
            if substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ''
            elif find_prefix(substring) is not None:
                split = find_prefix(substring)
                tokens.append(substring[:split])
                substring = substring[split:]
            elif find_suffix(substring) is not None:
                split = find_suffix(substring)
                suffixes.append(substring[-split:])
                substring = substring[:-split]
            elif find_infixes(substring):
                infixes = find_infixes(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[offset : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
                substring = substring[offset:]
            else:
                tokens.append(substring)
                substring = ''
        tokens.extend(reversed(suffixes))
    return tokens

The algorithm can be summarized as follows:

  1. Iterate over space-separated substrings
  2. Check whether we have an explicitly defined rule for this substring. If we do, use it.
  3. Otherwise, try to consume a prefix.
  4. If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
  5. If we didn’t consume a prefix, try to consume a suffix.
  6. If we can’t consume a prefix or suffix, look for “infixes” — stuff like hyphens etc.
  7. Once we can’t consume any more of the string, handle it as a single token.

Customizing spaCy’s Tokenizer class

Let’s imagine you wanted to create a tokenizer for a new language or specific domain. There are five things you would need to define:

  1. A dictionary of special cases. This handles things like contractions, units of measurement, emoticons, certain abbreviations, etc.
  2. A function prefix_search, to handle preceding punctuation, such as open quotes, open brackets, etc.
  3. A function suffix_search, to handle succeeding punctuation, such as commas, periods, close quotes, etc.
  4. A function infixes_finditer, to handle non-whitespace separators, such as hyphens etc.
  5. An optional boolean function token_match matching strings that should never be split, overriding the previous rules. Useful for things like URLs or numbers.

You shouldn’t usually need to create a Tokenizer subclass. Standard usage is to use re.compile() to build a regular expression object, and pass its .search() and .finditer() methods:

import re
import spacy
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''^[[("']''')
suffix_re = re.compile(r'''[])"']$''')
infix_re = re.compile(r'''[-~]''')
simple_url_re = re.compile(r'''^https?://''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=simple_url_re.match)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world.")
print([t.text for t in doc])

If you need to subclass the tokenizer instead, the relevant methods to specialize are find_prefix, find_suffix and find_infix.

Adding to existing rule sets

In many situations, you don’t necessarily need entirely custom rules. Sometimes you just want to add another character to the prefixes, suffixes or infixes. The default prefix, suffix and infix rules are available via the nlp object’s Defaults and the Tokenizer.suffix_search attribute is writable, so you can overwrite it with a compiled regular expression object using of the modified default rules. spaCy ships with utility functions to help you compile the regular expressions – for example, compile_suffix_regex:

suffixes = nlp.Defaults.suffixes + (r'''-+$''',)
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search

For an overview of the default regular expressions, see lang/punctuation.py. The Tokenizer.suffix_search attribute should be a function which takes a unicode string and returns a regex match object or None. Usually we use the .search attribute of a compiled regex object, but you can use some other function that behaves the same way.

Hooking an arbitrary tokenizer into the pipeline

The tokenizer is the first component of the processing pipeline and the only one that can’t be replaced by writing to nlp.pipeline. This is because it has a different signature from all the other components: it takes a text and returns a Doc, whereas all other components expect to already receive a tokenized Doc.

The processing pipeline

To overwrite the existing tokenizer, you need to replace nlp.tokenizer with a custom function that takes a text, and returns a Doc.

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = my_tokenizer
ArgumentTypeDescription
textunicodeThe raw text to tokenize.

Example: A custom whitespace tokenizer

To construct the tokenizer, we usually want attributes of the nlp pipeline. Specifically, we want the tokenizer to hold a reference to the vocabulary object. Let’s say we have the following class as our tokenizer:

import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(' ')
        # All tokens 'own' a subsequent space character in this tokenizer
        spaces = [True] * len(words)
        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(u"What's happened to me? he thought. It wasn't a dream.")
print([t.text for t in doc])

As you can see, we need a Vocab instance to construct this — but we won’t have it until we get back the loaded nlp object. The simplest solution is to build the tokenizer in two steps. This also means that you can reuse the “tokenizer factory” and initialize it with different instances of Vocab.

Bringing your own annotations

spaCy generally assumes by default that your data is raw text. However, sometimes your data is partially annotated, e.g. with pre-existing tokenization, part-of-speech tags, etc. The most common situation is that you have pre-defined tokenization. If you have a list of strings, you can create a Doc object directly. Optionally, you can also specify a list of boolean values, indicating whether each word has a subsequent space.

import spacy
from spacy.tokens import Doc
from spacy.lang.en import English

nlp = English()
doc = Doc(nlp.vocab, words=[u"Hello", u",", u"world", u"!"],
          spaces=[False, True, False, False])
print([(t.text, t.text_with_ws, t.whitespace_) for t in doc])

If provided, the spaces list must be the same length as the words list. The spaces list affects the doc.text, span.text, token.idx, span.start_char and span.end_char attributes. If you don’t provide a spaces sequence, spaCy will assume that all words are whitespace delimited.

import spacy
from spacy.tokens import Doc
from spacy.lang.en import English

nlp = English()
bad_spaces = Doc(nlp.vocab, words=[u"Hello", u",", u"world", u"!"])
good_spaces = Doc(nlp.vocab, words=[u"Hello", u",", u"world", u"!"],
                  spaces=[False, True, False, False])

print(bad_spaces.text)   # 'Hello , world !'
print(good_spaces.text)  # 'Hello, world!'

Once you have a Doc object, you can write to its attributes to set the part-of-speech tags, syntactic dependencies, named entities and other attributes. For details, see the respective usage pages.

Merging and splitting v2.1

The Doc.retokenize context manager lets you merge and split tokens. Modifications to the tokenization are stored and performed all at once when the context manager exits. To merge several tokens into one single token, pass a Span to retokenizer.merge. An optional dictionary of attrs lets you set attributes that will be assigned to the merged token – for example, the lemma, part-of-speech tag or entity type. By default, the merged token will receive the same attributes as the merged span’s root.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I live in New York")
print("Before:", [token.text for token in doc])

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[3:5], attrs={"LEMMA": "new york"})
print("After:", [token.text for token in doc])

If an attribute in the attrs is a context-dependent token attribute, it will be applied to the underlying Token. For example LEMMA, POS or DEP only apply to a word in context, so they’re token attributes. If an attribute is a context-independent lexical attribute, it will be applied to the underlying Lexeme, the entry in the vocabulary. For example, LOWER or IS_STOP apply to all words of the same spelling, regardless of the context.

The retokenizer.split method allows splitting one token into two or more tokens. This can be useful for cases where tokenization rules alone aren’t sufficient. For example, you might want to split “its” into the tokens “it” and “is” — but not the possessive pronoun “its”. You can write rule-based logic that can find only the correct “its” to split, but by that time, the Doc will already be tokenized.

This process of splitting a token requires more settings, because you need to specify the text of the individual tokens, optional per-token attributes and how the should be attached to the existing syntax tree. This can be done by supplying a list of heads – either the token to attach the newly split token to, or a (token, subtoken) tuple if the newly split token should be attached to another subtoken. In this case, “New” should be attached to “York” (the second split subtoken) and “York” should be attached to “in”.

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I live in NewYork")
print("Before:", [token.text for token in doc])
displacy.render(doc)  # displacy.serve if you're not in a Jupyter environment

with doc.retokenize() as retokenizer:
    heads = [(doc[3], 1), doc[2]]
    attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
    retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
print("After:", [token.text for token in doc])
displacy.render(doc)  # displacy.serve if you're not in a Jupyter environment

Specifying the heads as a list of token or (token, subtoken) tuples allows attaching split subtokens to other subtokens, without having to keep track of the token indices after splitting.

TokenHeadDescription
"New"(doc[3], 1)Attach this token to the second subtoken (index 1) that doc[3] will be split into, i.e. “York”.
"York"doc[2]Attach this token to doc[1] in the original Doc, i.e. “in”.

If you don’t care about the heads (for example, if you’re only running the tokenizer and not the parser), you can each subtoken to itself:

doc = nlp("I live in NewYorkCity")
with doc.retokenize() as retokenizer:
    heads = [(doc[3], 0), (doc[3], 1), (doc[3], 2)]    retokenizer.split(doc[3], ["New", "York", "City"], heads=heads)

Overwriting custom extension attributes

If you’ve registered custom extension attributes, you can overwrite them during tokenization by providing a dictionary of attribute names mapped to new values as the "_" key in the attrs. For merging, you need to provide one dictionary of attributes for the resulting merged token. For splitting, you need to provide a list of dictionaries with custom attributes, one per split subtoken.

import spacy
from spacy.tokens import Token

# Register a custom token attribute, token._.is_musician
Token.set_extension("is_musician", default=False)

nlp = spacy.load("en_core_web_sm")
doc = nlp("I like David Bowie")
print("Before:", [(token.text, token._.is_musician) for token in doc])

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[2:4], attrs={"_": {"is_musician": True}})
print("After:", [(token.text, token._.is_musician) for token in doc])

Sentence Segmentation

A Doc object’s sentences are available via the Doc.sents property. Unlike other libraries, spaCy uses the dependency parse to determine sentence boundaries. This is usually more accurate than a rule-based approach, but it also means you’ll need a statistical model and accurate predictions. If your texts are closer to general-purpose news or web text, this should work well out-of-the-box. For social media or conversational text that doesn’t follow the same rules, your application may benefit from a custom rule-based implementation. You can either plug a rule-based component into your processing pipeline or use the SentenceSegmenter component with a custom strategy.

Default: Using the dependency parse Needs model

To view a Doc’s sentences, you can iterate over the Doc.sents, a generator that yields Span objects.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)

Setting boundaries manually

spaCy’s dependency parser respects already set boundaries, so you can preprocess your Doc using custom rules before it’s parsed. This can be done by adding a custom pipeline component. Depending on your text, this may also improve accuracy, since the parser is constrained to predict parses consistent with the sentence boundaries.

Here’s an example of a component that implements a pre-processing rule for splitting on '...' tokens. The component is added before the parser, which is then used to further segment the text. This approach can be useful if you want to implement additional rules specific to your data, while still being able to take advantage of dependency-based sentence segmentation.

import spacy

text = u"this is a sentence...hello...and another sentence."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("Before:", [sent.text for sent in doc.sents])

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == "...":
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
print("After:", [sent.text for sent in doc.sents])

Rule-based pipeline component

The sentencizer component is a pipeline component that splits sentences on punctuation like ., ! or ?. You can plug it into your pipeline if you only need sentence boundaries without the dependency parse. Note that Doc.sents will raise an error if no sentence boundaries are set.

import spacy
from spacy.lang.en import English

nlp = English()  # just the language with no model
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)
doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)

Custom rule-based strategy

If you want to implement your own strategy that differs from the default rule-based approach of splitting on sentences, you can also instantiate the SentenceSegmenter directly and pass in your own strategy. The strategy should be a function that takes a Doc object and yields a Span for each sentence. Here’s an example of a custom segmentation strategy for splitting on newlines only:

from spacy.lang.en import English
from spacy.pipeline import SentenceSegmenter

def split_on_newlines(doc):
    start = 0
    seen_newline = False
    for word in doc:
        if seen_newline and not word.is_space:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text == '\n':
            seen_newline = True
    if start < len(doc):
        yield doc[start:len(doc)]

nlp = English()  # Just the language with no model
sentencizer = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sentencizer)
doc = nlp(u"This is a sentence\n\nThis is another sentence\nAnd more")
for sent in doc.sents:
    print([token.text for token in sent])