Rule-based matching

spaCy features a rule-matching engine that operates over tokens, similar to regular expressions. The rules can refer to token annotations and flags, and matches support callbacks to accept, modify and/or act on the match. The rule matcher also allows you to associate patterns with entity IDs, to allow some basic entity linking or disambiguation.

Here's a minimal example. We first add a pattern that specifies three tokens:

  1. A token whose lower-case form matches "hello"
  2. A token whose is_punct flag is set to True
  3. A token whose lower-case form matches "world"

Once we've added the pattern, we can use the matcher as a callable, to receive a list of (ent_id, start, end) tuples.

from spacy.matcher import Matcher
from spacy.attrs import IS_PUNCT, LOWER

matcher = Matcher(nlp.vocab)
matcher.add_pattern("HelloWorld", [{LOWER: "hello"}, {IS_PUNCT: True}, {LOWER: "world"}])

doc = nlp(u'Hello, world!')
matches = matcher(doc)

The returned matches include the ID, to let you associate the matches with the patterns. You can also group multiple patterns together, which is useful when you have a knowledge base of entities you want to match, and you want to write multiple patterns for each entity.

Entities and patterns

    "GoogleNow", # Entity ID -- Helps you act on the match.
    {"ent_type": "PRODUCT", "wiki_en": "Google_Now"}, # Arbitrary attributes (optional)

    "GoogleNow", # Entity ID -- Created if doesn't exist.
    [ # The pattern is a list of *Token Specifiers*.
        { # This Token Specifier matches tokens whose orth field is "Google"
          ORTH: "Google"
        { # This Token Specifier matches tokens whose orth field is "Now"
          ORTH: "Now"
    label=None # Can associate a label to the pattern-match, to handle it better.

Using quantifiers

!match exactly 0 timesnegation
*match 0 or more timesoptional, variable number
+match 1 or more timesmandatory, variable number
?match 0 or 1 timesoptional, max one

There are no nested or scoped quantifiers. You can build those behaviours with acceptors and on_match callbacks.

Acceptor functions

The acceptor keyword of matcher.add_entity() allows you to pass a function to reject or modify matches. The function you pass should take five arguments: doc, ent_id, label, start, and end. You can return a falsey value to reject the match, or return a 4-tuple (ent_id, label, start, end).

from spacy.tokens.doc import Doc
def trim_title(doc, ent_id, label, start, end):
    if doc[start].check_flag(IS_TITLE_TERM):
        return (ent_id, label, start+1, end)
        return (ent_id, label, start, end)
titles = set(title.lower() for title in [u'Mr.', 'Dr.', 'Ms.', u'Admiral'])
IS_TITLE_TERM = matcher.vocab.add_flag(lambda string: string.lower() in titles)
matcher.add_entity('PersonName', acceptor=trim_title)
matcher.add_pattern('PersonName', [{LOWER: 'mr.'}, {LOWER: 'cruise'}])
matcher.add_pattern('PersonName', [{LOWER: 'dr.'}, {LOWER: 'seuss'}])
doc = Doc(matcher.vocab, words=[u'Mr.', u'Cruise', u'likes', 'Dr.', u'Seuss'])
for ent_id, label, start, end in matcher(doc):
    # Cruise
    # Seuss

Passing an acceptor function allows you to match patterns with arbitrary logic that can't easily be expressed by a finite-state machine. You can look at the entirety of the matched phrase, and its context in the document, and decide to move the boundaries or reject the match entirely.

Callback functions

In spaCy <1.0, the Matcher automatically tagged matched phrases with entity types. Since spaCy 1.0, the matcher no longer acts on matches automatically. By default, the match list is returned for the user to action. However, it's often more convenient to register the required actions as a callback. You can do this by passing a function to the on_match keyword argument of matcher.add_entity.

The matcher will first collect all matches over the document. It will then iterate over the matches, look-up the callback for the entity ID that was matched, and invoke it. When the callback is invoked, it is passed four arguments: the matcher itself, the document, the position of the current match, and the total list of matches. This allows you to write callbacks that consider the entire set of matched phrases, so that you can resolve overlaps and other conflicts in whatever way you prefer.