Get started

Models & Languages

spaCy’s models can be installed as Python packages. This means that they’re a component of your application, just like any other module. They’re versioned and can be defined as a dependency in your requirements.txt. Models can be installed from a download URL or a local directory, manually or via pip. Their data can be located anywhere on your file system.

Quickstart

Install a default model, get the code to load it from within spaCy and an example to test it. For more options, see the section on available models below.

Language
Loading style
Options
python -m spacy download zh_core_web_smimport spacynlp = spacy.load("zh_core_web_sm")import zh_core_web_smnlp = zh_core_web_sm.load()doc = nlp("No text available yet")print([(w.text, w.pos_) for w in doc])python -m spacy download da_core_news_smimport spacynlp = spacy.load("da_core_news_sm")import da_core_news_smnlp = da_core_news_sm.load()doc = nlp("Dette er en sætning.")print([(w.text, w.pos_) for w in doc])python -m spacy download nl_core_news_smimport spacynlp = spacy.load("nl_core_news_sm")import nl_core_news_smnlp = nl_core_news_sm.load()doc = nlp("Dit is een zin.")print([(w.text, w.pos_) for w in doc])python -m spacy download en_core_web_smimport spacynlp = spacy.load("en_core_web_sm")import en_core_web_smnlp = en_core_web_sm.load()doc = nlp("This is a sentence.")print([(w.text, w.pos_) for w in doc])python -m spacy download fr_core_news_smimport spacynlp = spacy.load("fr_core_news_sm")import fr_core_news_smnlp = fr_core_news_sm.load()doc = nlp("C'est une phrase.")print([(w.text, w.pos_) for w in doc])python -m spacy download de_core_news_smimport spacynlp = spacy.load("de_core_news_sm")import de_core_news_smnlp = de_core_news_sm.load()doc = nlp("Dies ist ein Satz.")print([(w.text, w.pos_) for w in doc])python -m spacy download el_core_news_smimport spacynlp = spacy.load("el_core_news_sm")import el_core_news_smnlp = el_core_news_sm.load()doc = nlp("Αυτή είναι μια πρόταση.")print([(w.text, w.pos_) for w in doc])python -m spacy download it_core_news_smimport spacynlp = spacy.load("it_core_news_sm")import it_core_news_smnlp = it_core_news_sm.load()doc = nlp("Questa è una frase.")print([(w.text, w.pos_) for w in doc])python -m spacy download ja_core_news_smimport spacynlp = spacy.load("ja_core_news_sm")import ja_core_news_smnlp = ja_core_news_sm.load()doc = nlp("これは文章です。")print([(w.text, w.pos_) for w in doc])python -m spacy download lt_core_news_smimport spacynlp = spacy.load("lt_core_news_sm")import lt_core_news_smnlp = lt_core_news_sm.load()doc = nlp("No text available yet")print([(w.text, w.pos_) for w in doc])python -m spacy download nb_core_news_smimport spacynlp = spacy.load("nb_core_news_sm")import nb_core_news_smnlp = nb_core_news_sm.load()doc = nlp("Dette er en setning.")print([(w.text, w.pos_) for w in doc])python -m spacy download pl_core_news_smimport spacynlp = spacy.load("pl_core_news_sm")import pl_core_news_smnlp = pl_core_news_sm.load()doc = nlp("To jest zdanie.")print([(w.text, w.pos_) for w in doc])python -m spacy download pt_core_news_smimport spacynlp = spacy.load("pt_core_news_sm")import pt_core_news_smnlp = pt_core_news_sm.load()doc = nlp("Esta é uma frase.")print([(w.text, w.pos_) for w in doc])python -m spacy download ro_core_news_smimport spacynlp = spacy.load("ro_core_news_sm")import ro_core_news_smnlp = ro_core_news_sm.load()doc = nlp("Aceasta este o propoziție.")print([(w.text, w.pos_) for w in doc])python -m spacy download es_core_news_smimport spacynlp = spacy.load("es_core_news_sm")import es_core_news_smnlp = es_core_news_sm.load()doc = nlp("Esto es una frase.")print([(w.text, w.pos_) for w in doc])python -m spacy download xx_ent_wiki_smimport spacynlp = spacy.load("xx_ent_wiki_sm")import xx_ent_wiki_smnlp = xx_ent_wiki_sm.load()doc = nlp("This is a sentence about Facebook.")print([(ent.text, ent.label) for ent in doc.ents])

Language support

spaCy currently provides support for the following languages. You can help by improving the existing language data and extending the tokenization patterns. See here for details on how to contribute to model development.

LanguageCodeLanguage DataModels
Chinesezhlang/zh3 models
Danishdalang/da3 models
Dutchnllang/nl3 models
Englishenlang/en3 models
Frenchfrlang/fr3 models
Germandelang/de3 models
Greekellang/el3 models
Italianitlang/it3 models
Japanesejalang/ja3 models
Lithuanianltlang/lt3 models
Multi-languagexxlang/xx1 model
Norwegian Bokmålnblang/nb3 models
Polishpllang/pl3 models
Portugueseptlang/pt3 models
Romanianrolang/ro3 models
Spanisheslang/es3 models
Afrikaansaflang/afnone yet
Albaniansqlang/sqnone yet
Arabicarlang/arnone yet
Armenianhylang/hynone yet
Basqueeulang/eunone yet
Bengalibnlang/bnnone yet
Bulgarianbglang/bgnone yet
Catalancalang/canone yet
Croatianhrlang/hrnone yet
Czechcslang/csnone yet
Estonianetlang/etnone yet
Finnishfilang/finone yet
Gujaratigulang/gunone yet
Hebrewhelang/henone yet
Hindihilang/hinone yet
Hungarianhulang/hunone yet
Icelandicislang/isnone yet
Indonesianidlang/idnone yet
Irishgalang/ganone yet
Kannadaknlang/knnone yet
Koreankolang/konone yet
Latvianlvlang/lvnone yet
Ligurianlijlang/lijnone yet
Luxembourgishlblang/lbnone yet
Macedonianmklang/mknone yet
Malayalammllang/mlnone yet
Marathimrlang/mrnone yet
Nepalinelang/nenone yet
Persianfalang/fanone yet
Russianrulang/runone yet
Serbiansrlang/srnone yet
Sinhalasilang/sinone yet
Slovaksklang/sknone yet
Sloveniansllang/slnone yet
Swedishsvlang/svnone yet
Tagalogtllang/tlnone yet
Tamiltalang/tanone yet
Tatarttlang/ttnone yet
Telugutelang/tenone yet
Thaithlang/thnone yet
Turkishtrlang/trnone yet
Ukrainianuklang/uknone yet
Urduurlang/urnone yet
Vietnamesevilang/vinone yet
Yorubayolang/yonone yet

Multi-language support v2.0

As of v2.0, spaCy supports models trained on more than one language. This is especially useful for named entity recognition. The language ID used for multi-language or language-neutral models is xx. The language class, a generic subclass containing only the base language data, can be found in lang/xx.

To load your model with the neutral, multi-language class, simply set "language": "xx" in your model package’s meta.json. You can also import the class directly, or call util.get_lang_class() for lazy-loading.

Chinese language support v2.3

The Chinese language class supports three word segmentation options:

  1. Jieba: Chinese uses Jieba for word segmentation by default. It’s enabled when you create a new Chinese language class or call spacy.blank("zh").
  2. Character segmentation: Character segmentation is supported by disabling jieba and setting Chinese.Defaults.use_jieba = False before initializing the language class. As of spaCy v2.3.0, the meta tokenizer config options can be used to configure use_jieba.
  3. PKUSeg: In spaCy v2.3.0, support for PKUSeg has been added to support better segmentation for Chinese OntoNotes and the new Chinese models.

The meta argument of the Chinese language class supports the following following tokenizer config settings:

NameTypeDescription
pkuseg_modelunicodeRequired: Name of a model provided by pkuseg or the path to a local model directory.
pkuseg_user_dictunicodeOptional path to a file with one word per line which overrides the default pkuseg user dictionary.
require_pkusegboolOverrides all jieba settings (optional but strongly recommended).

Examples

# Load "default" model cfg = {"pkuseg_model": "default", "require_pkuseg": True} nlp = Chinese(meta={"tokenizer": {"config": cfg}}) # Load local model cfg = {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True} nlp = Chinese(meta={"tokenizer": {"config": cfg}}) # Override the user directory cfg = {"pkuseg_model": "default", "require_pkuseg": True, "pkuseg_user_dict": "/path"} nlp = Chinese(meta={"tokenizer": {"config": cfg}})

You can also modify the user dictionary on-the-fly:

# Append words to user dict
nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])

# Remove all words from user dict and replace with new words
nlp.tokenizer.pkuseg_update_user_dict(["中国"], reset=True)

# Remove all words from user dict
nlp.tokenizer.pkuseg_update_user_dict([], reset=True)

The Chinese models provided by spaCy include a custom pkuseg model trained only on Chinese OntoNotes 5.0, since the models provided by pkuseg include data restricted to research use. For research use, pkuseg provides models for several different domains ("default", "news" "web", "medicine", "tourism") and for other uses, pkuseg provides a simple training API:

import pkuseg
from spacy.lang.zh import Chinese

# Train pkuseg model
pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
# Load pkuseg model in spaCy Chinese tokenizer
nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}}})

Japanese language support v2.3

The Japanese language class uses SudachiPy for word segmentation and part-of-speech tagging. The default Japanese language class and the provided Japanese models use SudachiPy split mode A.

The meta argument of the Japanese language class can be used to configure the split mode to A, B or C.

Installing and using models

The easiest way to download a model is via spaCy’s download command. It takes care of finding the best-matching model compatible with your spaCy installation.

# Download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm

# Out-of-the-box: download best-matching default model and create shortcut link
python -m spacy download en

# Download exact model version (doesn't create shortcut link)
python -m spacy download en_core_web_sm-2.2.0 --direct

The download command will install the model via pip and place the package in your site-packages directory.

pip install spacy
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")

Installation via pip

To download a model directly using pip, point pip install to the URL or local path of the archive file. To find the direct link to a model, head over to the model releases, right click on the archive link and copy it to your clipboard.

# With external URL
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

# With local file
pip install /Users/you/en_core_web_sm-2.2.0.tar.gz

By default, this will install the model into your site-packages directory. You can then use spacy.load() to load it via its package name, create a shortcut link to assign it a custom name, or import it explicitly as a module. If you need to download models as part of an automated process, we recommend using pip with a direct link, instead of relying on spaCy’s download command.

You can also add the direct download link to your application’s requirements.txt. For more details, see the section on working with models in production.

Manual download and installation

In some cases, you might prefer downloading the data manually, for example to place it into a custom directory. You can download the model via your browser from the latest releases, or configure your own download script using the URL of the archive file. The archive consists of a model directory that contains another directory with the model data.

Directory structure

└── en_core_web_md-2.2.0.tar.gz # downloaded archive ├── meta.json # model meta data ├── setup.py # setup file for pip installation └── en_core_web_md # 📦 model package ├── __init__.py # init for pip installation ├── meta.json # model meta data └── en_core_web_md-2.2.0 # model data

You can place the model package directory anywhere on your local file system. To use it with spaCy, assign it a name by creating a shortcut link for the data directory.

Using models with spaCy

To load a model, use spacy.load with the model’s shortcut link, package name or a path to the data directory:

import spacy
nlp = spacy.load("en_core_web_sm")           # load model package "en_core_web_sm"
nlp = spacy.load("/path/to/en_core_web_sm")  # load package from a directory
nlp = spacy.load("en")                       # load model with shortcut link "en"

doc = nlp("This is a sentence.")

While previous versions of spaCy required you to maintain a data directory containing the models for each installation, you can now choose how and where you want to keep your data. For example, you could download all models manually and put them into a local directory. Whenever your spaCy projects need a model, you create a shortcut link to tell spaCy to load it from there. This means you’ll never end up with duplicate data.

The link command will create a symlink in the spacy/data directory.

python -m spacy link [package name or path] [shortcut] [--force]

The first argument is the package name (if the model was installed via pip), or a local path to the the model package. The second argument is the internal name you want to use for the model. Setting the --force flag will overwrite any existing links.

Examples

# set up shortcut link to load installed package as "en_default" python -m spacy link en_core_web_md en_default # set up shortcut link to load local model as "my_amazing_model" python -m spacy link /Users/you/model my_amazing_model

Importing models as modules

If you’ve installed a model via spaCy’s downloader, or directly via pip, you can also import it and then call its load() method with no arguments:

import en_core_web_sm

nlp = en_core_web_sm.load()
doc = nlp("This is a sentence.")

How you choose to load your models ultimately depends on personal preference. However, for larger code bases, we usually recommend native imports, as this will make it easier to integrate models with your existing build process, continuous integration workflow and testing framework. It’ll also prevent you from ever trying to load a model that is not installed, as your code will raise an ImportError immediately, instead of failing somewhere down the line when calling spacy.load().

For more details, see the section on working with models in production.

Using your own models

If you’ve trained your own model, for example for additional languages or custom named entities, you can save its state using the Language.to_disk() method. To make the model more convenient to deploy, we recommend wrapping it as a Python package.

For more information and a detailed guide on how to package your model, see the documentation on saving and loading models.

Using models in production

If your application depends on one or more models, you’ll usually want to integrate them into your continuous integration workflow and build process. While spaCy provides a range of useful helpers for downloading, linking and loading models, the underlying functionality is entirely based on native Python packages. This allows your application to handle a model like any other package dependency.

For an example of an automated model training and build process, see this overview of how we’re training and packaging our models for spaCy.

Downloading and requiring model dependencies

spaCy’s built-in download command is mostly intended as a convenient, interactive wrapper. It performs compatibility checks and prints detailed error messages and warnings. However, if you’re downloading models as part of an automated build process, this only adds an unnecessary layer of complexity. If you know which models your application needs, you should be specifying them directly.

Because all models are valid Python packages, you can add them to your application’s requirements.txt. If you’re running your own internal PyPi installation, you can upload the models there. pip’s requirements file format supports both package names to download via a PyPi server, as well as direct URLs.

requirements.txt

spacy>=2.2.0,<3.0.0 https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm

Specifying #egg= with the package name tells pip which package to expect from the download URL. This way, the package won’t be re-downloaded and overwritten if it’s already installed - just like when you’re downloading a package from PyPi.

All models are versioned and specify their spaCy dependency. This ensures cross-compatibility and lets you specify exact version requirements for each model. If you’ve trained your own model, you can use the package command to generate the required meta data and turn it into a loadable package.

Loading and testing models

Downloading models directly via pip won’t call spaCy’s link package command, which creates symlinks for model shortcuts. This means that you’ll have to run this command separately, or use the native import syntax to load the models:

import en_core_web_sm
nlp = en_core_web_sm.load()

In general, this approach is recommended for larger code bases, as it’s more “native”, and doesn’t depend on symlinks or rely on spaCy’s loader to resolve string names to model packages. If a model can’t be imported, Python will raise an ImportError immediately. And if a model is imported but not used, any linter will catch that.

Similarly, it’ll give you more flexibility when writing tests that require loading models. For example, instead of writing your own try and except logic around spaCy’s loader, you can use pytest’s importorskip() method to only run a test if a specific model or model version is installed. Each model package exposes a __version__ attribute which you can also use to perform your own version compatibility checks before loading a model.