Library architecture

The central data structures in spaCy are the Doc and the Vocab. The Doc object owns the sequence of tokens and all their annotations. The Vocab object owns a set of look-up tables that make common information available across documents. By centralizing strings, word vectors and lexical attributes, we avoid storing multiple copies of this data. This saves memory, and ensures there’s a single source of truth.

Text annotations are also designed to allow a single source of truth: the Doc object owns the data, and Span and Token are views that point into it. The Doc object is constructed by the Tokenizer, and then modified in place by the components of the pipeline. The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.

Library architecture

Container objects

DocA container for accessing linguistic annotations.
SpanA slice from a Doc object.
TokenAn individual token — i.e. a word, punctuation symbol, whitespace, etc.
LexemeAn entry in the vocabulary. It’s a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.

Processing pipeline

LanguageA text-processing pipeline. Usually you’ll load this once per process as nlp and pass the instance around your application.
TokenizerSegment text, and create Doc objects with the discovered segment boundaries.
LemmatizerDetermine the base forms of words.
MorphologyAssign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag.
TaggerAnnotate part-of-speech tags on Doc objects.
DependencyParserAnnotate syntactic dependencies on Doc objects.
EntityRecognizerAnnotate named entities, e.g. persons or products, on Doc objects.
TextCategorizerAssign categories or labels to Doc objects.
MatcherMatch sequences of tokens, based on pattern rules, similar to regular expressions.
PhraseMatcherMatch sequences of tokens based on phrases.
EntityRulerAdd entity spans to the Doc using token-based rules or exact phrase matches.
SentenceSegmenterImplement custom sentence boundary detection logic that doesn’t require the dependency parse.
Other functionsAutomatically apply something to the Doc, e.g. to merge spans of tokens.

Other classes

VocabA lookup table for the vocabulary that allows you to access Lexeme objects.
StringStoreMap strings to and from hash values.
VectorsContainer class for vector data keyed by string.
GoldParseCollection for training annotations.
GoldCorpusAn annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER.