Customizing the tokenizer

Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a Vocab instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

Adding special case tokenization rules

Most domains have at least some idiosyncracies that require custom tokenization rules. Here's how to add a special case rule to an existing Tokenizer instance:

import spacy
from spacy.symbols import ORTH, LEMMA, POS

nlp = spacy.load('en')
assert [w.text for w in nlp(u'gimme that')] == [u'gimme', u'that']
nlp.tokenizer.add_special_case(u'gimme',
    [
        {
            ORTH: u'gim',
            LEMMA: u'give',
            POS: u'VERB'},
        {
            ORTH: u'me'}])
assert [w.text for w in nlp(u'gimme that')] == [u'gim', u'me', u'that']
assert [w.lemma_ for w in nlp(u'gimme that')] == [u'give', u'me', u'that']

The special case doesn't have to match an entire whitespace-delimited substring. The tokenizer will incrementally split off punctuation, and keep looking up the remaining substring:

assert 'gimme' not in [w.text for w in nlp(u'gimme!')]
assert 'gimme' not in [w.text for w in nlp(u'("...gimme...?")')]

The special case rules have precedence over the punctuation splitting:

nlp.tokenizer.add_special_case(u'...gimme...?',
    [{
        ORTH: u'...gimme...?', LEMMA: u'give', TAG: u'VB'}])
assert len(nlp(u'...gimme...?')) == 1

Because the special-case rules allow you to set arbitrary token attributes, such as the part-of-speech, lemma, etc, they make a good mechanism for arbitrary fix-up rules. Having this logic live in the tokenizer isn't very satisfying from a design perspective, however, so the API may eventually be exposed on the Language class itself.

How spaCy's tokenizer works

spaCy introduces a novel tokenization algorithm, that gives a better balance between performance, ease of definition, and ease of alignment into the original string.

After consuming a prefix or infix, we consult the special cases again. We want the special cases to handle things like "don't" in English, and we want the same rule to work for "(don't)!". We do this by splitting off the open bracket, then the exclamation, then the close bracket, and finally matching the special-case. Here's an implementation of the algorithm in Python, optimized for readability rather than performance:

def tokenizer_pseudo_code(text, find_prefix, find_suffix,
                          find_infixes, special_cases):
    tokens = []
    for substring in text.split(' '):
        suffixes = []
        while substring:
            if substring in special_cases:
                tokens.extend(special_cases[substring])
                substring = ''
            elif find_prefix(substring) is not None:
                split = find_prefix(substring)
                tokens.append(substring[:split])
                substring = substring[split:]
            elif find_suffix(substring) is not None:
                split = find_suffix(substring)
                suffixes.append(substring[split:])
                substring = substring[:split]
            elif find_infixes(substring):
                infixes = find_infixes(substring)
                offset = 0
                for match in infixes:
                    tokens.append(substring[i : match.start()])
                    tokens.append(substring[match.start() : match.end()])
                    offset = match.end()
                substring = substring[offset:]
            else:
                tokens.append(substring)
                substring = ''
        tokens.extend(reversed(suffixes))
        return tokens

The algorithm can be summarized as follows:

  1. Iterate over space-separated substrings
  2. Check whether we have an explicitly defined rule for this substring. If we do, use it.
  3. Otherwise, try to consume a prefix.
  4. If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.
  5. If we didn't consume a prefix, try to consume a suffix.
  6. If we can't consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.
  7. Once we can't consume any more of the string, handle it as a single token.

Customizing spaCy's Tokenizer class

Let's imagine you wanted to create a tokenizer for a new language. There are four things you would need to define:

  1. A dictionary of special cases. This handles things like contractions, units of measurement, emoticons, certain abbreviations, etc.
  2. A function prefix_search, to handle preceding punctuation, such as open quotes, open brackets, etc
  3. A function suffix_search, to handle succeeding punctuation, such as commas, periods, close quotes, etc.
  4. A function infixes_finditer, to handle non-whitespace separators, such as hyphens etc.

You shouldn't usually need to create a Tokenizer subclass. Standard usage is to use re.compile() to build a regular expression object, and pass its .search() and .finditer() methods:

import re
from spacy.tokenizer import Tokenizer

prefix_re = re.compile(r'''[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']''')
def create_tokenizer(nlp):
    return Tokenizer(nlp.vocab,
            prefix_search=prefix_re.search,
            suffix_search=suffix_re.search)

nlp = spacy.load('en', tokenizer=create_make_doc)

If you need to subclass the tokenizer instead, the relevant methods to specialize are find_prefix, find_suffix and find_infix.

Hooking an arbitrary tokenizer into the pipeline

You can pass a custom tokenizer using the make_doc keyword, when you're creating the pipeline:

import spacy

nlp = spacy.load('en', make_doc=my_tokenizer)

However, this approach often leaves us with a chicken-and-egg problem. To construct the tokenizer, we usually want attributes of the nlp pipeline. Specifically, we want the tokenizer to hold a reference to the pipeline's vocabulary object. Let's say we have the following class as our tokenizer:

import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer(object):
    def __init__(self, nlp):
        self.vocab = nlp.vocab

    def __call__(self, text):
        words = text.split(' ')
        # All tokens 'own' a subsequent space character in this tokenizer
        spaces = [True] * len(word)
        return Doc(self.vocab, words=words, spaces=spaces)

As you can see, we need a vocab instance to construct this — but we won't get the vocab instance until we get back the nlp object from spacy.load(). The simplest solution is to build the object in two steps:

nlp = spacy.load('en')
nlp.make_doc = WhitespaceTokenizer(nlp)

You can instead pass the class to the create_make_doc keyword, which is invoked as callback once the nlp object is ready:

nlp = spacy.load('en', create_make_doc=WhitespaceTokenizer)

Finally, you can of course create your own subclasses, and create a bound make_doc method. The disadvantage of this approach is that spaCy uses inheritance to give each language-specific pipeline its own class. If you're working with multiple languages, a naive solution will therefore require one custom class per language you're working with. This might be at least annoying. You may be able to do something more generic by doing some clever magic with metaclasses or mixins, if that's the sort of thing you're into.

Read next: Adding languages