Get started

Models & Languages

spaCy’s trained pipelines can be installed as Python packages. This means that they’re a component of your application, just like any other module. They’re versioned and can be defined as a dependency in your requirements.txt. Trained pipelines can be installed from a download URL or a local directory, manually or via pip. Their data can be located anywhere on your file system.

Quickstart

Install a default trained pipeline package, get the code to load it from within spaCy and an example to test it. For more options, see the section on available packages below.

Language
Loading style
Select for
Options
python -m spacy download en_core_web_smimport spacynlp = spacy.load("en_core_web_sm")import en_core_web_smnlp = en_core_web_sm.load()doc = nlp("This is a sentence.")print([(w.text, w.pos_) for w in doc])

Language support

spaCy currently provides support for the following languages. You can help by improving the existing language data and extending the tokenization patterns. See here for details on how to contribute to development. Also see the training documentation for how to train your own pipelines on your data.

LanguageCodeLanguage DataPipelines
Catalancalang/ca4 packages
Chinesezhlang/zh4 packages
Danishdalang/da4 packages
Dutchnllang/nl3 packages
Englishenlang/en4 packages
Frenchfrlang/fr4 packages
Germandelang/de4 packages
Greekellang/el3 packages
Italianitlang/it3 packages
Japanesejalang/ja3 packages
Lithuanianltlang/lt3 packages
Macedonianmklang/mk3 packages
Multi-languagexxlang/xx2 packages
Norwegian Bokmålnblang/nb3 packages
Polishpllang/pl3 packages
Portugueseptlang/pt3 packages
Romanianrolang/ro3 packages
Russianrulang/ru3 packages
Spanisheslang/es4 packages
Afrikaansaflang/afnone yet
Albaniansqlang/sqnone yet
Arabicarlang/arnone yet
Armenianhylang/hynone yet
Basqueeulang/eunone yet
Bengalibnlang/bnnone yet
Bulgarianbglang/bgnone yet
Croatianhrlang/hrnone yet
Czechcslang/csnone yet
Estonianetlang/etnone yet
Finnishfilang/finone yet
Gujaratigulang/gunone yet
Hebrewhelang/henone yet
Hindihilang/hinone yet
Hungarianhulang/hunone yet
Icelandicislang/isnone yet
Indonesianidlang/idnone yet
Irishgalang/ganone yet
Kannadaknlang/knnone yet
Koreankolang/konone yet
Kyrgyzkylang/kynone yet
Latvianlvlang/lvnone yet
Ligurianlijlang/lijnone yet
Luxembourgishlblang/lbnone yet
Malayalammllang/mlnone yet
Marathimrlang/mrnone yet
Nepalinelang/nenone yet
Persianfalang/fanone yet
Sanskritsalang/sanone yet
Serbiansrlang/srnone yet
Setswanatnlang/tnnone yet
Sinhalasilang/sinone yet
Slovaksklang/sknone yet
Sloveniansllang/slnone yet
Swedishsvlang/svnone yet
Tagalogtllang/tlnone yet
Tamiltalang/tanone yet
Tatarttlang/ttnone yet
Telugutelang/tenone yet
Thaithlang/thnone yet
Turkishtrlang/trnone yet
Ukrainianuklang/uknone yet
Urduurlang/urnone yet
Vietnamesevilang/vinone yet
Yorubayolang/yonone yet

Multi-language support

spaCy also supports pipelines trained on more than one language. This is especially useful for named entity recognition. The language ID used for multi-language or language-neutral pipelines is xx. The language class, a generic subclass containing only the base language data, can be found in lang/xx.

To train a pipeline using the neutral multi-language class, you can set lang = "xx" in your training config. You can also import the MultiLanguage class directly, or call spacy.blank("xx") for lazy-loading.

Chinese language support

The Chinese language class supports three word segmentation options, char, jieba and pkuseg.

config.cfg

[nlp.tokenizer] @tokenizers = "spacy.zh.ChineseTokenizer" segmenter = "char"
SegmenterDescription
charCharacter segmentation: Character segmentation is the default segmentation option. It’s enabled when you create a new Chinese language class or call spacy.blank("zh").
jiebaJieba: to use Jieba for word segmentation, you can set the option segmenter to "jieba".
pkusegPKUSeg: As of spaCy v2.3.0, support for PKUSeg has been added to support better segmentation for Chinese OntoNotes and the provided Chinese pipelines. Enable PKUSeg by setting tokenizer option segmenter to "pkuseg".

The initialize method for the Chinese tokenizer class supports the following config settings for loading pkuseg models:

NameDescription
pkuseg_modelName of a model provided by spacy-pkuseg or the path to a local model directory. str
pkuseg_user_dictOptional path to a file with one word per line which overrides the default pkuseg user dictionary. Defaults to "default", the default provided dictionary. str

The initialization settings are typically provided in the training config and the data is loaded in before training and serialized with the model. This allows you to load the data from a local path and save out your pipeline and config, without requiring the same local path at runtime. See the usage guide on the config lifecycle for more background on this.

config.cfg

[initialize] [initialize.tokenizer] pkuseg_model = "/path/to/model" pkuseg_user_dict = "default"

You can also initialize the tokenizer for a blank language class by calling its initialize method:

Examples

# Initialize the pkuseg tokenizer cfg = {"segmenter": "pkuseg"} nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}}) # Load spaCy's OntoNotes model nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes") # Load pkuseg's "news" model nlp.tokenizer.initialize(pkuseg_model="news") # Load local model nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model") # Override the user directory nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes", pkuseg_user_dict="/path/to/user_dict")

You can also modify the user dictionary on-the-fly:

# Append words to user dict
nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])

# Remove all words from user dict and replace with new words
nlp.tokenizer.pkuseg_update_user_dict(["中国"], reset=True)

# Remove all words from user dict
nlp.tokenizer.pkuseg_update_user_dict([], reset=True)

The Chinese pipelines provided by spaCy include a custom pkuseg model trained only on Chinese OntoNotes 5.0, since the models provided by pkuseg include data restricted to research use. For research use, pkuseg provides models for several different domains ("mixed" (equivalent to "default" from pkuseg packages), "news" "web", "medicine", "tourism") and for other uses, pkuseg provides a simple training API:

import spacy_pkuseg as pkuseg
from spacy.lang.zh import Chinese

# Train pkuseg model
pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")

# Load pkuseg model in spaCy Chinese tokenizer
cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")

Japanese language support

The Japanese language class uses SudachiPy for word segmentation and part-of-speech tagging. The default Japanese language class and the provided Japanese pipelines use SudachiPy split mode A. The tokenizer config can be used to configure the split mode to A, B or C.

config.cfg

[nlp.tokenizer] @tokenizers = "spacy.ja.JapaneseTokenizer" split_mode = "A"

Installing and using trained pipelines

The easiest way to download a trained pipeline is via spaCy’s download command. It takes care of finding the best-matching package compatible with your spaCy installation.

# Download best-matching version of a package for your spaCy installation
python -m spacy download en_core_web_sm

# Download exact package version
python -m spacy download en_core_web_sm-3.0.0 --direct

The download command will install the package via pip and place the package in your site-packages directory.

pip install -U spacy
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")

If you’re in a Jupyter notebook or similar environment, you can use the ! prefix to execute commands. Make sure to restart your kernel or runtime after installation (just like you would when installing other Python packages) to make sure that the installed pipeline package can be found.

!python -m spacy download en_core_web_sm

Installation via pip

To download a trained pipeline directly using pip, point pip install to the URL or local path of the wheel file or archive. Installing the wheel is usually more efficient. To find the direct link to a package, head over to the releases, right click on the archive link and copy it to your clipboard.

# With external URL
$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl
$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz

# With local file
$ pip install /Users/you/en_core_web_sm-3.0.0-py3-none-any.whl
$ pip install /Users/you/en_core_web_sm-3.0.0.tar.gz

By default, this will install the pipeline package into your site-packages directory. You can then use spacy.load to load it via its package name or import it explicitly as a module. If you need to download pipeline packages as part of an automated process, we recommend using pip with a direct link, instead of relying on spaCy’s download command.

You can also add the direct download link to your application’s requirements.txt. For more details, see the section on working with pipeline packages in production.

Manual download and installation

In some cases, you might prefer downloading the data manually, for example to place it into a custom directory. You can download the package via your browser from the latest releases, or configure your own download script using the URL of the archive file. The archive consists of a package directory that contains another directory with the pipeline data.

Directory structure

└── en_core_web_md-3.0.0.tar.gz # downloaded archive ├── setup.py # setup file for pip installation ├── meta.json # copy of pipeline meta └── en_core_web_md # 📦 pipeline package ├── __init__.py # init for pip installation └── en_core_web_md-3.0.0 # pipeline data ├── config.cfg # pipeline config ├── meta.json # pipeline meta └── ... # directories with component data

You can place the pipeline package directory anywhere on your local file system.

Installation from Python

Since the spacy download command installs the pipeline as a Python package, we always recommend running it from the command line, just like you install other Python packages with pip install. However, if you need to, or if you want to integrate the download process into another CLI command, you can also import and call the download function used by the CLI via Python.

import spacy
spacy.cli.download("en_core_web_sm")

Using trained pipelines with spaCy

To load a pipeline package, use spacy.load with the package name or a path to the data directory:

import spacy
nlp = spacy.load("en_core_web_sm")           # load package "en_core_web_sm"
nlp = spacy.load("/path/to/en_core_web_sm")  # load package from a directory

doc = nlp("This is a sentence.")

Importing pipeline packages as modules

If you’ve installed a trained pipeline via spacy download or directly via pip, you can also import it and then call its load() method with no arguments:

import en_core_web_sm

nlp = en_core_web_sm.load()
doc = nlp("This is a sentence.")

How you choose to load your trained pipelines ultimately depends on personal preference. However, for larger code bases, we usually recommend native imports, as this will make it easier to integrate pipeline packages with your existing build process, continuous integration workflow and testing framework. It’ll also prevent you from ever trying to load a package that is not installed, as your code will raise an ImportError immediately, instead of failing somewhere down the line when calling spacy.load(). For more details, see the section on working with pipeline packages in production.

Using trained pipelines in production

If your application depends on one or more trained pipeline packages, you’ll usually want to integrate them into your continuous integration workflow and build process. While spaCy provides a range of useful helpers for downloading and loading pipeline packages, the underlying functionality is entirely based on native Python packaging. This allows your application to handle a spaCy pipeline like any other package dependency.

Downloading and requiring package dependencies

spaCy’s built-in download command is mostly intended as a convenient, interactive wrapper. It performs compatibility checks and prints detailed error messages and warnings. However, if you’re downloading pipeline packages as part of an automated build process, this only adds an unnecessary layer of complexity. If you know which packages your application needs, you should be specifying them directly.

Because pipeline packages are valid Python packages, you can add them to your application’s requirements.txt. If you’re running your own internal PyPi installation, you can upload the pipeline packages there. pip’s requirements file format supports both package names to download via a PyPi server, as well as direct URLs.

requirements.txt

spacy>=3.0.0,<4.0.0 https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz#egg=en_core_web_sm

Specifying #egg= with the package name tells pip which package to expect from the download URL. This way, the package won’t be re-downloaded and overwritten if it’s already installed - just like when you’re downloading a package from PyPi.

All pipeline packages are versioned and specify their spaCy dependency. This ensures cross-compatibility and lets you specify exact version requirements for each pipeline. If you’ve trained your own pipeline, you can use the spacy package command to generate the required meta data and turn it into a loadable package.

Loading and testing pipeline packages

Pipeline packages are regular Python packages, so you can also import them as a package using Python’s native import syntax, and then call the load method to load the data and return an nlp object:

import en_core_web_sm
nlp = en_core_web_sm.load()

In general, this approach is recommended for larger code bases, as it’s more “native”, and doesn’t rely on spaCy’s loader to resolve string names to packages. If a package can’t be imported, Python will raise an ImportError immediately. And if a package is imported but not used, any linter will catch that.

Similarly, it’ll give you more flexibility when writing tests that require loading pipelines. For example, instead of writing your own try and except logic around spaCy’s loader, you can use pytest’s importorskip() method to only run a test if a specific pipeline package or version is installed. Each pipeline package exposes a __version__ attribute which you can also use to perform your own version compatibility checks before loading it.