Overview

Library Architecture

The central data structures in spaCy are the Language class, the Vocab and the Doc object. The Language class is used to process a text and turn it into a Doc object. It’s typically stored as a variable called nlp. The Doc object owns the sequence of tokens and all their annotations. By centralizing strings, word vectors and lexical attributes in the Vocab, we avoid storing multiple copies of this data. This saves memory, and ensures there’s a single source of truth.

Text annotations are also designed to allow a single source of truth: the Doc object owns the data, and Span and Token are views that point into it. The Doc object is constructed by the Tokenizer, and then modified in place by the components of the pipeline. The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.

Library architecture {w:1080, h:1254}

Container objects

NameDescription
DocA container for accessing linguistic annotations.
DocBinA collection of Doc objects for efficient binary serialization. Also used for training data.
ExampleA collection of training annotations, containing two Doc objects: the reference data and the predictions.
LanguageProcessing class that turns text into Doc objects. Different languages implement their own subclasses of it. The variable is typically called nlp.
LexemeAn entry in the vocabulary. It’s a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.
SpanA slice from a Doc object.
SpanGroupA named collection of spans belonging to a Doc.
TokenAn individual token — i.e. a word, punctuation symbol, whitespace, etc.

Processing pipeline

The processing pipeline consists of one or more pipeline components that are called on the Doc in order. The tokenizer runs before the components. Pipeline components can be added using Language.add_pipe. They can contain a statistical model and trained weights, or only make rule-based modifications to the Doc. spaCy provides a range of built-in components for different language processing tasks and also allows adding custom components.

The processing pipeline
NameDescription
AttributeRulerSet token attributes using matcher rules.
DependencyParserPredict syntactic dependencies.
EditTreeLemmatizerPredict base forms of words.
EntityLinkerDisambiguate named entities to nodes in a knowledge base.
EntityRecognizerPredict named entities, e.g. persons or products.
EntityRulerAdd entity spans to the Doc using token-based rules or exact phrase matches.
LemmatizerDetermine the base forms of words using rules and lookups.
MorphologizerPredict morphological features and coarse-grained part-of-speech tags.
SentenceRecognizerPredict sentence boundaries.
SentencizerImplement rule-based sentence boundary detection that doesn’t require the dependency parse.
TaggerPredict part-of-speech tags.
TextCategorizerPredict categories or labels over the whole document.
Tok2VecApply a “token-to-vector” model and set its outputs.
TokenizerSegment raw text and create Doc objects from the words.
TrainablePipeClass that all trainable pipeline components inherit from.
TransformerUse a transformer model and set its outputs.
Other functionsAutomatically apply something to the Doc, e.g. to merge spans of tokens.

Matchers

Matchers help you find and extract information from Doc objects based on match patterns describing the sequences you’re looking for. A matcher operates on a Doc and gives you access to the matched tokens in context.

NameDescription
DependencyMatcherMatch sequences of tokens based on dependency trees using Semgrex operators.
MatcherMatch sequences of tokens, based on pattern rules, similar to regular expressions.
PhraseMatcherMatch sequences of tokens based on phrases.

Other classes

NameDescription
CorpusClass for managing annotated corpora for training and evaluation data.
KnowledgeBaseAbstract base class for storage and retrieval of data for entity linking.
InMemoryLookupKBImplementation of KnowledgeBase storing all data in memory.
CandidateObject associating a textual mention with a specific entity contained in a KnowledgeBase.
LookupsContainer for convenient access to large lookup tables and dictionaries.
MorphAnalysisA morphological analysis.
MorphologyStore morphological analyses and map them to and from hash values.
ScorerCompute evaluation scores.
StringStoreMap strings to and from hash values.
VectorsContainer class for vector data keyed by string.
VocabThe shared vocabulary that stores strings and gives you access to Lexeme objects.