spaCy’s trained pipelines can be installed as Python packages. This means
that they’re a component of your application, just like any other module.
They’re versioned and can be defined as a dependency in your
Trained pipelines can be installed from a download URL or a local directory,
manually or via pip. Their data can be
located anywhere on your file system.
Install a default trained pipeline package, get the code to load it from within spaCy and an example to test it. For more options, see the section on available packages below.
python -m spacy download en_core_web_smimport spacynlp = spacy.load("en_core_web_sm")import en_core_web_smnlp = en_core_web_sm.load()doc = nlp("This is a sentence.")print([(w.text, w.pos_) for w in doc])
If a trained pipeline is available for a language, you can download it using the
spacy download command as shown above. In order to use
languages that don’t yet come with a trained pipeline, you have to import them
directly, or use
A blank pipeline is typically just a tokenizer. You might want to create a blank
pipeline when you only need a tokenizer, when you want to add more components
from scratch, or for testing purposes. Initializing the language object directly
yields the same result as generating it using
spacy.blank(). In both cases the
default configuration for the chosen language is loaded, and no pretrained
components will be available.
spaCy currently provides support for the following languages. You can help by improving the existing language data and extending the tokenization patterns. for details on how to contribute to development. Also see the training documentation for how to train your own pipelines on your data.
|Norwegian Bokmål||3 packages|
|Ancient Greek||none yet|
|Lower Sorbian||none yet|
|Upper Sorbian||none yet|
spaCy also supports pipelines trained on more than one language. This is
especially useful for named entity recognition. The language ID used for
multi-language or language-neutral pipelines is
xx. The language class, a
generic subclass containing only the base language data, can be found in
To train a pipeline using the neutral multi-language class, you can set
lang = "xx" in your training config. You can also
MultiLanguage class directly, or call
spacy.blank("xx") for lazy-loading.
The Chinese language class supports three word segmentation options,
initialize method for the Chinese tokenizer class supports the following
config settings for loading
|Name of a model provided by |
|Optional path to a file with one word per line which overrides the default |
The initialization settings are typically provided in the training config and the data is loaded in before training and serialized with the model. This allows you to load the data from a local path and save out your pipeline and config, without requiring the same local path at runtime. See the usage guide on the config lifecycle for more background on this.
You can also initialize the tokenizer for a blank language class by calling its
You can also modify the user dictionary on-the-fly:
The provided by spaCy include a custom
model trained only on
Chinese OntoNotes 5.0, since the
models provided by
pkuseg include data restricted to research use. For
pkuseg provides models for several different domains (
"tourism") and for other uses,
pkuseg provides a simple
The Japanese language class uses
segmentation and part-of-speech tagging. The default Japanese language class and
the provided Japanese pipelines use SudachiPy split mode
A. The tokenizer
config can be used to configure the split mode to
Extra information, such as reading, inflection form, and the SudachiPy
normalized form, is available in
C split modes,
subtokens are stored in
The default MeCab-based Korean tokenizer requires:
For some Korean datasets and tasks, the rule-based tokenizer is better-suited than MeCab. To configure a Korean pipeline with the rule-based tokenizer:
The easiest way to download a trained pipeline is via spaCy’s
download command. It takes care of finding the
best-matching package compatible with your spaCy installation.
The download command will install the package via
pip and place the package in your
If you’re in a Jupyter notebook or similar environment, you can use the
Make sure to restart your kernel or runtime after installation (just like
you would when installing other Python packages) to make sure that the installed
pipeline package can be found.
To download a trained pipeline directly using
pip install to the URL or local
path of the wheel file or archive. Installing the wheel is usually more
By default, this will install the pipeline package into your
directory. You can then use
spacy.load to load it via its package name or
import it explicitly as a module. If you need to download
pipeline packages as part of an automated process, we recommend using pip with a
direct link, instead of relying on spaCy’s
You can also add the direct download link to your application’s
requirements.txt. For more details, see the section on
working with pipeline packages in production.
In some cases, you might prefer downloading the data manually, for example to place it into a custom directory. You can download the package via your browser from the , or configure your own download script using the URL of the archive file. The archive consists of a package directory that contains another directory with the pipeline data.
You can place the pipeline package directory anywhere on your local file system.
spacy download command installs the pipeline as
a Python package, we always recommend running it from the command line, just
like you install other Python packages with
pip install. However, if you need
to, or if you want to integrate the download process into another CLI command,
you can also import and call the
download function used by the CLI via Python.
To load a pipeline package, use
the package name or a path to the data directory:
If you’ve installed a trained pipeline via
or directly via pip, you can also
import it and then call its
with no arguments:
How you choose to load your trained pipelines ultimately depends on personal
preference. However, for larger code bases, we usually recommend native
imports, as this will make it easier to integrate pipeline packages with your
existing build process, continuous integration workflow and testing framework.
It’ll also prevent you from ever trying to load a package that is not installed,
as your code will raise an
ImportError immediately, instead of failing
somewhere down the line when calling
spacy.load(). For more details, see the
section on working with pipeline packages in production.
If your application depends on one or more trained pipeline packages, you’ll usually want to integrate them into your continuous integration workflow and build process. While spaCy provides a range of useful helpers for downloading and loading pipeline packages, the underlying functionality is entirely based on native Python packaging. This allows your application to handle a spaCy pipeline like any other package dependency.
download command is mostly intended as a
convenient, interactive wrapper. It performs compatibility checks and prints
detailed error messages and warnings. However, if you’re downloading pipeline
packages as part of an automated build process, this only adds an unnecessary
layer of complexity. If you know which packages your application needs, you
should be specifying them directly.
Because pipeline packages are valid Python packages, you can add them to your
requirements.txt. If you’re running your own internal PyPi
installation, you can upload the pipeline packages there. pip’s
requirements file format
supports both package names to download via a PyPi server, as well as
All pipeline packages are versioned and specify their spaCy dependency. This
ensures cross-compatibility and lets you specify exact version requirements for
each pipeline. If you’ve trained your own pipeline, you can
spacy package command to generate the required
meta data and turn it into a loadable package.
Pipeline packages are regular Python packages, so you can also import them as a
package using Python’s native
import syntax, and then call the
to load the data and return an
In general, this approach is recommended for larger code bases, as it’s more
“native”, and doesn’t rely on spaCy’s loader to resolve string names to
packages. If a package can’t be imported, Python will raise an
immediately. And if a package is imported but not used, any linter will catch
Similarly, it’ll give you more flexibility when writing tests that require
loading pipelines. For example, instead of writing your own
logic around spaCy’s loader, you can use
method to only run a test if a specific pipeline package or version is
installed. Each pipeline package exposes a
__version__ attribute which you can
also use to perform your own version compatibility checks before loading it.