Whether you’re new to spaCy, or just want to brush up on some NLP basics and implementation details – this page should have you covered. Each section will explain one of spaCy’s features in simple terms and with examples or illustrations. Some sections will also reappear across the usage guides as a quick introduction.
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.
If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?
spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.
In the documentation, you’ll come across mentions of spaCy’s features and capabilities. Some of them refer to linguistic concepts, while others are related to more general machine learning functionality.
|Tokenization||Segmenting text into words, punctuations marks etc.|
|Part-of-speech (POS) Tagging||Assigning word types to tokens, like verb or noun.|
|Dependency Parsing||Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.|
|Lemmatization||Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.|
|Sentence Boundary Detection (SBD)||Finding and segmenting individual sentences.|
|Named Entity Recognition (NER)||Labelling named “real-world” objects, like persons, companies or locations.|
|Entity Linking (EL)||Disambiguating textual entities to unique identifiers in a knowledge base.|
|Similarity||Comparing words, text spans and documents and how similar they are to each other.|
|Text Classification||Assigning categories or labels to a whole document, or parts of a document.|
|Rule-based Matching||Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.|
|Training||Updating and improving a statistical model’s predictions.|
|Serialization||Saving objects to files or byte strings.|
While some of spaCy’s features work independently, others require trained pipelines to be loaded, which enable spaCy to predict linguistic annotations – for example, whether a word is a verb or a noun. A trained pipeline can consist of multiple components that use a statistical model trained on labeled data. spaCy currently offers trained pipelines for a variety of languages, which can be installed as individual Python modules. Pipeline packages can differ in size, speed, memory usage, accuracy and the data they include. The package you choose always depends on your use case and the texts you’re working with. For a general-purpose use case, the small, default packages are always a good start. They typically include the following components:
- Binary weights for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.
- Lexical entries in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.
- Data files like lemmatization rules and lookup tables.
- Word vectors, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
- Configuration options, like the language and processing pipeline settings and model implementations to use, to put spaCy in the correct state when you load the pipeline.
spaCy provides a variety of linguistic annotations to give you insights into a text’s grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you’re analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether “google” is used as a verb, or refers to the website or company in a specific context.
Once you’ve downloaded and installed a trained pipeline, you
can load it via
spacy.load. This will return a
Language object containing all components and data needed to process text. We
usually call it
nlp. Calling the
nlp object on a string of text will return
Even though a
Doc is processed – e.g. split into individual words and
annotated – it still holds all information of the original text, like
whitespace characters. You can always get the offset of a token into the
original string, or reconstruct the original by joining the tokens and their
trailing whitespace. This way, you’ll never lose any information when processing
text with spaCy.
During processing, spaCy first tokenizes the text, i.e. segments it into
words, punctuation and so on. This is done by applying rules specific to each
language. For example, punctuation at the end of a sentence should be split off
– whereas “U.K.” should remain one token. Each
Doc consists of individual
tokens, and we can iterate over them:
First, the raw text is split on whitespace characters, similar to
text.split(' '). Then, the tokenizer processes the text from left to right. On
each substring, it performs two checks:
Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.
Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
If there’s a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.
While punctuation rules are usually pretty general, tokenizer exceptions
strongly depend on the specifics of the individual language. This is why each
available language has its own subclass, like
German, that loads in lists of hard-coded data and exception
Part-of-speech tags and dependencies Needs model
After tokenization, spaCy can parse and tag a given
Doc. This is where
the trained pipeline and its statistical models come in, which enable spaCy to
make predictions of which tag or label most likely applies in this context.
A trained component includes binary data that is produced by showing a system
enough examples for it to make predictions that generalize across the language –
for example, a word following “the” in English is most likely a noun.
Linguistic annotations are available as
Token attributes. Like many NLP libraries, spaCy
encodes all strings to hash values to reduce memory usage and improve
efficiency. So to get the readable string representation of an attribute, we
need to add an underscore
_ to its name:
Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its dependencies look like:
Named Entities Needs model
A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.
Named entities are available as the
ents property of a
|Apple||0||5||Companies, agencies, institutions.|
|U.K.||27||31||Geopolitical entity, i.e. countries, cities, states.|
|$1 billion||44||54||Monetary values, including unit.|
Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its named entities look like:
Word vectors and similarity Needs model
Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:
Pipeline packages that come with built-in word vectors make them available as
default to an average of their token vectors. You can also check if a token has
a vector assigned, and get the L2 norm, which can be used to normalize vectors.
The words “dog”, “cat” and “banana” are all pretty common in English, so they’re
part of the pipeline’s vocabulary, and come with a vector. The word “afskfsd” on
the other hand is a lot less common and out-of-vocabulary – so its vector
representation consists of 300 dimensions of
0, which means it’s practically
nonexistent. If your application will benefit from a large vocabulary with
more vectors, you should consider using one of the larger pipeline packages or
loading in a full vector package, for example,
en_core_web_lg, which includes 685k unique
spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that’s similar to what they’re currently looking at, or label a support ticket as a duplicate if it’s very similar to an already existing one.
Lexeme comes with a
method that lets you compare it with another object, and determine the
similarity. Of course similarity is always subjective – whether two words, spans
or documents are similar really depends on how you’re looking at it. spaCy’s
similarity implementation usually assumes a pretty general-purpose definition of
Computing similarity scores can be helpful in many situations, but it’s also important to maintain realistic expectations about what information it can provide. Words can be related to each other in many ways, so a single “similarity” score will always be a mix of different signals, and vectors trained on different data can produce very different results that may not be useful for your purpose. Here are some important considerations to keep in mind:
- There’s no objective definition of similarity. Whether “I like burgers” and “I like pasta” is similar depends on your application. Both talk about food preferences, which makes them very similar – but if you’re analyzing mentions of food, those sentences are pretty dissimilar, because they talk about very different foods.
- The similarity of
Spanobjects defaults to the average of the token vectors. This means that the vector for “fast food” is the average of the vectors for “fast” and “food”, which isn’t necessarily representative of the phrase “fast food”.
- Vector averaging means that the vector of multiple tokens is insensitive to the order of the words. Two documents expressing the same meaning with dissimilar wording will return a lower similarity score than two documents that happen to contain the same words while expressing different meanings.
When you call
nlp on a text, spaCy first tokenizes the text to produce a
Doc is then processed in several different steps – this is also
referred to as the processing pipeline. The pipeline used by the
trained pipelines typically include a tagger, a lemmatizer, a parser
and an entity recognizer. Each pipeline component returns the processed
which is then passed on to the next component.
|tokenizer||Segment text into tokens.|
|tagger||Assign part-of-speech tags.|
|parser||Assign dependency labels.|
|ner||Detect and label named entities.|
|lemmatizer||Assign base forms.|
|textcat||Assign document labels.|
|custom||custom components||Assign custom attributes, methods or properties.|
The capabilities of a processing pipeline always depend on the components, their models and how they were trained. For example, a pipeline for named entity recognition needs to include a trained named entity recognizer component with a statistical model and weights that enable it to make predictions of entity labels. This is why each pipeline specifies its components and their settings in the config:
The statistical components like the tagger or parser are typically independent
and don’t share any data between each other. For example, the named entity
recognizer doesn’t use any features set by the tagger and parser, and so on.
This means that you can swap them, or remove single components from the pipeline
without affecting the others. However, components may share a “token-to-vector”
You can read more about this in the docs on
Custom components may also depend on annotations set by other components. For
example, a custom lemmatizer may need the part-of-speech tags assigned, so it’ll
only work if it’s added after the tagger. The parser will respect pre-defined
sentence boundaries, so if a previous component in the pipeline sets them, its
dependency predictions may be different. Similarly, it matters if you add the
EntityRuler before or after the statistical entity
recognizer: if it’s added before, the entity recognizer will take the existing
entities into account when making predictions. The
EntityLinker, which resolves named entities to knowledge
base IDs, should be preceded by a pipeline component that recognizes entities
such as the
The tokenizer is a “special” component and isn’t part of the regular pipeline.
It also doesn’t show up in
nlp.pipe_names. The reason is that there can only
really be one tokenizer, and while all other pipeline components take a
and return it, the tokenizer takes a string of text and turns it into a
Doc. You can still customize the tokenizer, though.
writable, so you can either create your own
Tokenizer class from scratch,
or even replace it with an
entirely custom function.
The central data structures in spaCy are the
Vocab and the
Doc object. The
is used to process a text and turn it into a
Doc object. It’s typically stored
as a variable called
Doc object owns the sequence of tokens and
all their annotations. By centralizing strings, word vectors and lexical
attributes in the
Vocab, we avoid storing multiple copies of this data. This
saves memory, and ensures there’s a single source of truth.
Text annotations are also designed to allow a single source of truth: the
object owns the data, and
views that point into it. The
Doc object is constructed by the
Tokenizer, and then modified in place by the components
of the pipeline. The
Language object coordinates these components. It takes
raw text and sends it through the pipeline, returning an annotated document.
It also orchestrates training and serialization.
|A container for accessing linguistic annotations.|
|A collection of |
|A collection of training annotations, containing two |
|Processing class that turns text into |
|An entry in the vocabulary. It’s a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.|
|A slice from a |
|A named collection of spans belonging to a |
|An individual token — i.e. a word, punctuation symbol, whitespace, etc.|
The processing pipeline consists of one or more pipeline components that are
called on the
Doc in order. The tokenizer runs before the components. Pipeline
components can be added using
They can contain a statistical model and trained weights, or only make
rule-based modifications to the
Doc. spaCy provides a range of built-in
components for different language processing tasks and also allows adding
|Set token attributes using matcher rules.|
|Predict syntactic dependencies.|
|Predict base forms of words.|
|Disambiguate named entities to nodes in a knowledge base.|
|Predict named entities, e.g. persons or products.|
|Add entity spans to the |
|Determine the base forms of words using rules and lookups.|
|Predict morphological features and coarse-grained part-of-speech tags.|
|Predict sentence boundaries.|
|Implement rule-based sentence boundary detection that doesn’t require the dependency parse.|
|Predict part-of-speech tags.|
|Predict categories or labels over the whole document.|
|Apply a “token-to-vector” model and set its outputs.|
|Segment raw text and create |
|Class that all trainable pipeline components inherit from.|
|Use a transformer model and set its outputs.|
|Automatically apply something to the |
Matchers help you find and extract information from
based on match patterns describing the sequences you’re looking for. A matcher
operates on a
Doc and gives you access to the matched tokens in context.
|Match sequences of tokens based on dependency trees using Semgrex operators.|
|Match sequences of tokens, based on pattern rules, similar to regular expressions.|
|Match sequences of tokens based on phrases.|
|Class for managing annotated corpora for training and evaluation data.|
|Abstract base class for storage and retrieval of data for entity linking.|
|Implementation of |
|Object associating a textual mention with a specific entity contained in a |
|Container for convenient access to large lookup tables and dictionaries.|
|A morphological analysis.|
|Store morphological analyses and map them to and from hash values.|
|Compute evaluation scores.|
|Map strings to and from hash values.|
|Container class for vector data keyed by string.|
|The shared vocabulary that stores strings and gives you access to |
Whenever possible, spaCy tries to store data in a vocabulary, the
Vocab, that will be shared by multiple documents. To save
memory, spaCy also encodes all strings to hash values – in this case for
example, “coffee” has the hash
3197928453018144401. Entity labels like “ORG”
and part-of-speech tags like “VERB” are also encoded. Internally, spaCy only
“speaks” in hash values.
If you process lots of documents containing the word “coffee” in all kinds of
different contexts, storing the exact string “coffee” every time would take up
way too much space. So instead, spaCy hashes the string and stores it in the
StringStore. You can think of the
StringStore as a
lookup table that works in both directions – you can look up a string to get
its hash, or a hash to get its string:
Now that all strings are encoded, the entries in the vocabulary don’t need to
include the word text themselves. Instead, they can look it up in the
StringStore via its hash value. Each entry in the vocabulary, also called
Lexeme, contains the context-independent information about
a word. For example, no matter if “love” is used as a verb or a noun in some
context, its spelling and whether it consists of alphabetic characters won’t
ever change. Its hash value will also always be the same.
The mapping of words to hashes doesn’t depend on any state. To make sure each value is unique, spaCy uses a hash function to calculate the hash based on the word string. This also means that the hash for “coffee” will always be the same, no matter which pipeline you’re using or how you’ve configured spaCy.
However, hashes cannot be reversed and there’s no way to resolve
3197928453018144401 back to “coffee”. All spaCy can do is look it up in the
vocabulary. That’s why you always need to make sure all objects you create have
access to the same vocabulary. If they don’t, spaCy might not be able to find
the strings it needs.
If the vocabulary doesn’t contain a string for
3197928453018144401, spaCy will
raise an error. You can re-add “coffee” manually, but this only works if you
actually know that the document contains that word. To prevent this problem,
spaCy will also export the
Vocab when you save a
nlp object. This
will give you the object and its encoded annotations, plus the “key” to decode
If you’ve been modifying the pipeline, vocabulary, vectors and entities, or made
updates to the component models, you’ll eventually want to save your
progress – for example, everything that’s in your
nlp object. This means
you’ll have to translate its contents and structure into a format that can be
saved, like a file or a byte string. This process is called serialization. spaCy
comes with built-in serialization methods and supports the
spaCy’s tagger, parser, text categorizer and many other components are powered by statistical models. Every “decision” these components make – for example, which part-of-speech tag to assign, or whether a word is a named entity – is a prediction based on the model’s current weight values. The weight values are estimated based on examples the model has seen during training. To train a model, you first need training data – examples of text, and the labels you want the model to predict. This could be a part-of-speech tag, a named entity or any other information.
Training is an iterative process in which the model’s predictions are compared against the reference annotations in order to estimate the gradient of the loss. The gradient of the loss is then used to calculate the gradient of the weights through backpropagation. The gradients indicate how the weight values should be changed so that the model’s predictions become more similar to the reference labels over time.
When training a model, we don’t just want it to memorize our examples – we want it to come up with a theory that can be generalized across unseen data. After all, we don’t just want the model to learn that this one instance of “Amazon” right here is a company – we want it to learn that “Amazon”, in contexts like this, is most likely a company. That’s why the training data should always be representative of the data we want to process. A model trained on Wikipedia, where sentences in the first person are extremely rare, will likely perform badly on Twitter. Similarly, a model trained on romantic novels will likely perform badly on legal text.
This also means that in order to know how the model is performing, and whether it’s learning the right things, you don’t only need training data – you’ll also need evaluation data. If you only test the model with the data it was trained on, you’ll have no idea how well it’s generalizing. If you want to train a model from scratch, you usually need at least a few hundred examples for both training and evaluation.
Training config files include all settings and hyperparameters for training
your pipeline. Instead of providing lots of arguments on the command line, you
only need to pass your
config.cfg file to
This also makes it easy to integrate custom models and architectures, written in
your framework of choice. A pipeline’s
config.cfg is considered the “single
source of truth”, both at training and runtime.
Pipe class helps you implement your own trainable
components that have their own model instance, make predictions over
objects and can be updated using
spacy train. This lets you
plug fully custom machine learning components into your pipeline that can be
configured via a single training config.
Every language is different – and usually full of exceptions and special
cases, especially amongst the most common words. Some of these exceptions are
shared across languages, while others are entirely specific – usually so
specific that they need to be hard-coded. The
lang module contains all language-specific data,
organized in simple Python files. This makes the data easy to update and extend.
The shared language data in the directory root includes rules that can be
generalized across languages – for example, rules for basic punctuation, emoji,
emoticons and single-letter abbreviations. The individual language data in a
submodule contains rules that are only relevant to a particular language. It
also takes care of putting together all components and creating the
Language subclass – for example,
values are defined in the
|Stop words||List of most common words of a language that are often useful to filter out, for example “and” or “I”. Matching tokens will return |
|Tokenizer exceptions||Special-case rules for the tokenizer, for example, contractions like “can’t” and abbreviations with punctuation, like “U.K.”.|
|Punctuation rules||Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.|
|Character classes||Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons.|
|Lexical attributes||Custom functions for setting lexical attributes on tokens, e.g. |
|Syntax iterators||Functions that compute views of a |
|Lemmatizer||Custom lemmatizer implementation and lemmatization tables.|
We’re very happy to see the spaCy community grow and include a mix of people from all kinds of different backgrounds – computational linguistics, data science, deep learning, research and more. If you’d like to get involved, below are some answers to the most important questions and resources for further reading.
Bugs suck, and we’re doing our best to continuously improve the tests and fix bugs as soon as possible. Before you submit an issue, do a quick search and check if the problem has already been reported. If you’re having installation or loading problems, make sure to also check out the troubleshooting guide. Help with spaCy is available via the following platforms:
- Stack Overflow: Usage questions and everything related to problems with your specific code. The Stack Overflow community is much larger than ours, so if your problem can be solved by others, you’ll receive help much quicker.
- : General discussion, project ideas and usage questions. Meet other community members to get help with a specific code implementation, discuss ideas for new projects/plugins, support more languages, and share best practices.
- : Bug reports and improvement suggestions, i.e. everything that’s likely spaCy’s fault. This also includes problems with the trained pipelines beyond statistical imprecisions, like patterns that point to a bug.
You don’t have to be an NLP expert or Python pro to contribute, and we’re happy
to help you get started. If you’re new to spaCy, a good place to start is the
help wanted (easy) label
on GitHub, which we use to tag bugs and feature requests that are easy and
self-contained. We also appreciate contributions to the docs – whether it’s
fixing a typo, improving an example or adding additional explanations. You’ll
find a “Suggest edits” link at the bottom of each page that points you to the
Another way of getting involved is to help us improve the language data – especially if you happen to speak one of the languages currently in alpha support. Even adding simple tokenizer exceptions, stop words or lemmatizer data can make a big difference. It will also make it easier for us to provide a trained pipeline for the language in the future. Submitting a test that documents a bug or performance issue, or covers functionality that’s especially important for your application is also very helpful. This way, you’ll also make sure we never accidentally introduce regressions to the parts of the library that you care about the most.
First, congrats – we’d love to check it out! When you share your project on Twitter, don’t forget to tag @spacy_io so we don’t miss it. If you think your project would be a good fit for the spaCy Universe, feel free to submit it! Tutorials are also incredibly valuable to other users and a great way to get exposure. So we strongly encourage writing up your experiences, or sharing your code and some tips and tricks on your blog. Since our website is open-source, you can add your project or tutorial by making a pull request on GitHub.
If you would like to use the spaCy logo on your site, please get in touch and ask us first. However, if you want to show support and tell others that your project is using spaCy, you can grab one of our spaCy badges here: