Documentation

What's New in v3.0

New features, backwards incompatibilities and migration guide

spaCy v3.0 features all new transformer-based pipelines that bring spaCy’s accuracy right up to the current state-of-the-art. You can use any pretrained transformer to train your own pipelines, and even share one transformer between multiple components with multi-task learning. Training is now fully configurable and extensible, and you can define your own custom models using PyTorch, TensorFlow and other frameworks. The new spaCy projects system lets you describe whole end-to-end workflows in a single file, giving you an easy path from prototype to production, and making it easy to clone and adapt best-practice projects for your own use cases.

New Features

This section contains an overview of the most important new features and improvements. The API docs include additional deprecation notes. New methods and functions that were introduced in this version are marked with the tag v3.0.

Transformer-based pipelines

spaCy v3.0 features all new transformer-based pipelines that bring spaCy’s accuracy right up to the current state-of-the-art. You can use any pretrained transformer to train your own pipelines, and even share one transformer between multiple components with multi-task learning. spaCy’s transformer support interoperates with PyTorch and the HuggingFace transformers library, giving you access to thousands of pretrained models for your pipelines.

Pipeline components listening to shared embedding component
PipelineParserTaggerNER
en_core_web_trf (spaCy v3)95.197.889.8
en_core_web_lg (spaCy v3)92.097.485.5
en_core_web_lg (spaCy v2)91.997.285.5

Full pipeline accuracy on the OntoNotes 5.0 corpus (reported on the development set).

Named Entity Recognition SystemOntoNotesCoNLL ‘03
spaCy RoBERTa (2020)89.891.6
Stanza (StanfordNLP)188.892.1
Flair289.793.1

Named entity recognition accuracy on the OntoNotes 5.0 and CoNLL-2003 corpora. See NLP-progress for more results. Project template: benchmarks/ner_conll03. 1. Qi et al. (2020). 2. Akbik et al. (2018).

New trained transformer-based pipelines

PackageLanguageTransformerTaggerParserNER
en_core_web_trfEnglishroberta-base97.895.289.9
de_dep_news_trfGermanbert-base-german-cased99.095.8-
es_dep_news_trfSpanishbert-base-spanish-wwm-cased98.294.6-
fr_dep_news_trfFrenchcamembert-base95.794.4-
zh_core_web_trfChinesebert-base-chinese92.576.675.4

New training workflow and config system

spaCy v3.0 introduces a comprehensive and extensible system for configuring your training runs. A single configuration file describes every detail of your training run, with no hidden defaults, making it easy to rerun your experiments and track changes. You can use the quickstart widget or the init config command to get started. Instead of providing lots of arguments on the command line, you only need to pass your config.cfg file to spacy train. Training config files include all settings and hyperparameters for training your pipeline. Some settings can also be registered functions that you can swap out and customize, making it easy to implement your own custom models and architectures.

Illustration of pipeline lifecycle

Custom models using any framework

spaCy’s new configuration system makes it easy to customize the neural network models used by the different pipeline components. You can also implement your own architectures via spaCy’s machine learning library Thinc that provides various layers and utilities, as well as thin wrappers around frameworks like PyTorch, TensorFlow and MXNet. Component models all follow the same unified Model API and each Model can also be used as a sublayer of a larger network, allowing you to freely combine implementations from different frameworks into a single model.

Illustration of Pipe methods

Manage end-to-end workflows with projects

spaCy projects let you manage and share end-to-end spaCy workflows for different use cases and domains, and orchestrate training, packaging and serving your custom pipelines. You can start off by cloning a pre-defined project template, adjust it to fit your needs, load in your data, train a pipeline, export it as a Python package, upload your outputs to a remote storage and share your results with your team.

Illustration of project workflow and commands

spaCy projects also make it easy to integrate with other tools in the data science and machine learning ecosystem, including DVC for data version control, Prodigy for creating labelled data, Streamlit for building interactive apps, FastAPI for serving models in production, Ray for parallel training, Weights & Biases for experiment tracking, and more!

Parallel and distributed training with Ray

Ray is a fast and simple framework for building and running distributed applications. You can use Ray to train spaCy on one or more remote machines, potentially speeding up your training process. The Ray integration is powered by a lightweight extension package, spacy-ray, that automatically adds the ray command to your spaCy CLI if it’s installed in the same environment. You can then run spacy ray train for parallel training.

Illustration of setup

New built-in pipeline components

spaCy v3.0 includes several new trainable and rule-based components that you can add to your pipeline and customize for your use case:

NameDescription
SentenceRecognizerTrainable component for sentence segmentation.
MorphologizerTrainable component to predict morphological features.
LemmatizerStandalone component for rule-based and lookup lemmatization.
AttributeRulerComponent for setting token attributes using match patterns.
TransformerComponent for using transformer models in your pipeline, accessing outputs and aligning tokens. Provided via spacy-transformers.
TrainablePipeBase class for trainable pipeline components.
Multi-label TextCategorizerTrainable component for multi-label text classification.

New and improved pipeline component APIs

Defining, configuring, reusing, training and analyzing pipeline components is now easier and more convenient. The @Language.component and @Language.factory decorators let you register your component, define its default configuration and meta data, like the attribute values it assigns and requires. Any custom component can be included during training, and sourcing components from existing trained pipelines lets you mix and match custom pipelines. The nlp.analyze_pipes method outputs structured information about the current pipeline and its components, including the attributes they assign, the scores they compute during training and whether any required attributes aren’t set.

Dependency matching

The new DependencyMatcher lets you match patterns within the dependency parse using Semgrex operators. It follows the same API as the token-based Matcher. A pattern added to the dependency matcher consists of a list of dictionaries, with each dictionary describing a token to match and its relation to an existing token in the pattern.

Dependency matcher pattern

Type hints and type-based data validation

spaCy v3.0 officially drops support for Python 2 and now requires Python 3.6+. This also means that the code base can take full advantage of type hints. spaCy’s user-facing API that’s implemented in pure Python (as opposed to Cython) now comes with type hints. The new version of spaCy’s machine learning library Thinc also features extensive type support, including custom types for models and arrays, and a custom mypy plugin that can be used to type-check model definitions.

For data validation, spaCy v3.0 adopts pydantic. It also powers the data validation of Thinc’s config system, which lets you register custom functions with typed arguments, reference them in your config and see validation errors if the argument values don’t match.

New methods, attributes and commands

The following methods, attributes and commands are new in spaCy v3.0.

NameDescription
Token.lexAccess a token’s Lexeme.
Token.morphAccess a token’s morphological analysis.
Doc.spansNamed span groups to store and access collections of potentially overlapping spans. Uses the new SpanGroup data structure.
Doc.has_annotationCheck whether a doc has annotation on a token attribute.
Language.select_pipesContext manager for enabling or disabling specific pipeline components for a block.
Language.disable_pipe, Language.enable_pipeDisable or enable a loaded pipeline component (but don’t remove it).
Language.analyze_pipesAnalyze components and their interdependencies.
Language.resume_trainingExperimental: continue training a trained pipeline and initialize “rehearsal” for components that implement a rehearse method to prevent catastrophic forgetting.
@Language.factory, @Language.componentDecorators for registering pipeline component factories and simple stateless component functions.
Language.has_factoryCheck whether a component factory is registered on a language class.
Language.get_factory_meta, Language.get_pipe_metaGet the FactoryMeta with component metadata for a factory or instance name.
Language.configThe config used to create the current nlp object. An instance of Config and can be saved to disk and used for training.
Language.components, Language.component_namesAll available components and component names, including disabled components that are not run as part of the pipeline.
Language.disabledNames of disabled components that are not run as part of the pipeline.
TrainablePipe.scoreMethod on pipeline components that returns a dictionary of evaluation scores.
registryFunction registry to map functions to string names that can be referenced in configs.
util.load_meta, util.load_configUpdated helpers for loading a pipeline’s meta.json and config.cfg.
util.get_installed_modelsNames of all pipeline packages installed in the environment.
init config, init fill-config, debug configCLI commands for initializing, auto-filling and debugging training configs.
init vectorsConvert word vectors for use with spaCy.
init labelsGenerate JSON files for the labels in the data to speed up training.
projectSuite of CLI commands for cloning, running and managing spaCy projects.
raySuite of CLI commands for parallel training with Ray, provided by the spacy-ray extension package.

New and updated documentation

To help you get started with spaCy v3.0 and the new features, we’ve added several new or rewritten documentation pages, including a new usage guide on embeddings, transformers and transfer learning, a guide on training pipelines and models rewritten from scratch, a page explaining the new spaCy projects and updated usage documentation on custom pipeline components. We’ve also added a bunch of new illustrations and new API reference pages documenting spaCy’s machine learning model architectures and the expected data formats. API pages about pipeline components now include more information, like the default config and implementation, and we’ve adopted a more detailed format for documenting argument and return types.

Library architecture

Backwards Incompatibilities

As always, we’ve tried to keep the breaking changes to a minimum and focus on changes that were necessary to support the new features, fix problems or improve usability. The following section lists the relevant changes to the user-facing API. For specific examples of how to rewrite your code, check out the migration guide.

API changes

  • Pipeline package symlinks, the link command and shortcut names are now deprecated. There can be many different trained pipelines and not just one “English model”, so you should always use the full package name like en_core_web_sm explicitly.
  • A pipeline’s meta.json is now only used to provide meta information like the package name, author, license and labels. It’s not used to construct the processing pipeline anymore. This is all defined in the config.cfg, which also includes all settings used to train the pipeline.
  • The train, pretrain and debug data commands now only take a config.cfg.
  • Language.add_pipe now takes the string name of the component factory instead of the component function.
  • Custom pipeline components now need to be decorated with the @Language.component or @Language.factory decorator.
  • The Language.update, Language.evaluate and TrainablePipe.update methods now all take batches of Example objects instead of Doc and GoldParse objects, or raw text and a dictionary of annotations.
  • The begin_training methods have been renamed to initialize and now take a function that returns a sequence of Example objects to initialize the model instead of a list of tuples.
  • Matcher.add and PhraseMatcher.add now only accept a list of patterns as the second argument (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.
  • The Doc flags like Doc.is_parsed or Doc.is_tagged have been replaced by Doc.has_annotation.
  • The spacy.gold module has been renamed to spacy.training.
  • The PRON_LEMMA symbol and -PRON- as an indicator for pronoun lemmas has been removed.
  • The TAG_MAP and MORPH_RULES in the language data have been replaced by the more flexible AttributeRuler.
  • The Lemmatizer is now a standalone pipeline component and doesn’t provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization.
  • Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference using the keyword-only tag.
  • The textcat pipeline component is now only applicable for classification of mutually exclusives classes - i.e. one predicted class per input sentence or document. To perform multi-label classification, use the new textcat_multilabel component instead.

Removed or renamed API

RemovedReplacement
Language.disable_pipesLanguage.select_pipes, Language.disable_pipe, Language.enable_pipe
Language.begin_training, Pipe.begin_training, …Language.initialize, Pipe.initialize, …
Doc.is_tagged, Doc.is_parsed, …Doc.has_annotation
GoldParseExample
GoldCorpusCorpus
KnowledgeBase.load_bulk, KnowledgeBase.dumpKnowledgeBase.from_disk, KnowledgeBase.to_disk
KnowledgeBase.get_candidatesKnowledgeBase.get_alias_candidates
Matcher.pipe, PhraseMatcher.pipenot needed
gold.offsets_from_biluo_tags, gold.spans_from_biluo_tags, gold.biluo_tags_from_offsetstraining.biluo_tags_to_offsets, training.biluo_tags_to_spans, training.offsets_to_biluo_tags
spacy init-modelspacy init vectors
spacy debug-dataspacy debug data
spacy profilespacy debug profile
spacy link, util.set_data_path, util.get_data_pathnot needed, symlinks are deprecated

The following methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you’ve been working with more recent versions of spaCy v2.x, it’s unlikely that your code relied on them.

RemovedReplacement
Doc.tokens_from_listDoc.__init__
Doc.merge, Span.mergeDoc.retokenize
Token.string, Span.string, Span.upper, Span.lowerSpan.text, Token.text
Language.tagger, Language.parser, Language.entityLanguage.get_pipe
keyword-arguments like vocab=False on to_disk, from_disk, to_bytes, from_bytesexclude=["vocab"]
n_threads argument on Tokenizer, Matcher, PhraseMatchern_process
verbose argument on Language.evaluatelogging (DEBUG)
SentenceSegmenter hook, SimilarityHookuser hooks, Sentencizer, SentenceRecognizer

Migrating from v2.x

Downloading and loading trained pipelines

Symlinks and shortcuts like en have been deprecated for a while, and are now not supported anymore. There are many different trained pipelines with different capabilities and not just one “English model”. In order to download and load a package, you should always use its full name – for instance, en_core_web_sm.

Custom pipeline components and factories

Custom pipeline components now have to be registered explicitly using the @Language.component or @Language.factory decorator. For simple functions that take a Doc and return it, all you have to do is add the @Language.component decorator to it and assign it a name:

Stateless function components

For class components that are initialized with settings and/or the shared nlp object, you can use the @Language.factory decorator. Also make sure that that the method used to initialize the factory has two named arguments: nlp (the current nlp object) and name (the string name of the component instance).

Stateful class components

Instead of decorating your class, you could also add a factory function that takes the arguments nlp and name and returns an instance of your component:

Stateful class components with factory function

The @Language.component and @Language.factory decorators now take care of adding an entry to the component factories, so spaCy knows how to load a component back in from its string name. You won’t have to write to Language.factories manually anymore.

Adding components to the pipeline

The nlp.add_pipe method now takes the string name of the component factory instead of a callable component. This allows spaCy to track and serialize components that have been added and their settings.

nlp.add_pipe now also returns the pipeline component itself, so you can access its attributes. The nlp.create_pipe method is now mostly internals and you typically shouldn’t have to use it in your code.

If you need to add a component from an existing trained pipeline, you can now use the source argument on nlp.add_pipe. This will check that the component is compatible, and take care of porting over all config. During training, you can also reference existing trained components in your config and decide whether or not they should be updated with more data.

Configuring pipeline components with settings

Because pipeline components are now added using their string names, you won’t have to instantiate the component classes directly anymore. To configure the component, you can now use the config argument on nlp.add_pipe.

The config corresponds to the component settings in the config.cfg and will overwrite the default config defined by the components.

Adding match patterns

The Matcher.add, PhraseMatcher.add and DependencyMatcher.add methods now only accept a list of patterns as the second argument (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.

Migrating attributes in tokenizer exceptions

Tokenizer exceptions are now only allowed to set ORTH and NORM values as part of the token attributes. Exceptions for other attributes such as TAG and LEMMA should be moved to an AttributeRuler component:

Migrating tag maps and morph rules

Instead of defining a tag_map and morph_rules in the language data, spaCy v3.0 now manages mappings and exceptions with a separate and more flexible pipeline component, the AttributeRuler. See the usage guide for examples. If you have tag maps and morph rules in the v2.x format, you can load them into the attribute ruler before training using the [initialize] block of your config.

Using Lexeme Tables

To use tables like lexeme_prob when training a model from scratch, you need to add an entry to the initialize block in your config. Here’s what that looks like for the existing trained pipelines:

config.cfg (excerpt)

The AttributeRuler also provides two handy helper methods load_from_tag_map and load_from_morph_rules that let you load in your existing tag map or morph rules:

Migrating Doc flags

The Doc flags Doc.is_tagged, Doc.is_parsed, Doc.is_nered and Doc.is_sentenced are deprecated in v3.0 and replaced by Doc.has_annotation method, which refers to the token attribute symbols (the same symbols used in Matcher patterns):

Training pipelines and models

To train your pipelines, you should now pretty much always use the spacy train CLI. You shouldn’t have to put together your own training scripts anymore, unless you really want to. The training commands now use a flexible config file that describes all training settings and hyperparameters, as well as your pipeline, components and architectures to use. The --code argument lets you pass in code containing custom registered functions that you can reference in your config. To get started, check out the quickstart widget.

Binary .spacy training data format

spaCy v3.0 uses a new binary training data format created by serializing a DocBin, which represents a collection of Doc objects. This means that you can train spaCy pipelines using the same format it outputs: annotated Doc objects. The binary format is extremely efficient in storage, especially when packing multiple documents together. You can convert your existing JSON-formatted data using the spacy convert command, which outputs .spacy files:

Training config

The easiest way to get started with a training config is to use the init config command or the quickstart widget. You can define your requirements, and it will auto-generate a starter config with the best-matching default settings.

If you’ve exported a starter config from our quickstart widget, you can use the init fill-config to fill it with all default values. You can then use the auto-generated config.cfg for training:

Modifying tokenizer settings

If you were using a base model with spacy train to customize the tokenizer settings in v2, your modifications can be provided in the [initialize.before_init] callback.

Write a registered callback that modifies the tokenizer settings and specify this callback in your config:

functions.py

When training, provide the function above with the --code option:

The train step requires the --code option with your registered functions from the [initialize] block, but since those callbacks are only required during the initialization step, you don’t need to provide them with the final pipeline package. However, to make it easier for others to replicate your training setup, you can choose to package the initialization callbacks with the pipeline package or to publish them separately.

Training via the Python API

For most use cases, you shouldn’t have to write your own training scripts anymore. Instead, you can use spacy train with a config file and custom registered functions if needed. You can even register callbacks that can modify the nlp object at different stages of its lifecycle to fully customize it before training.

If you do decide to use the internal training API from Python, you should only need a few small modifications to convert your scripts from spaCy v2.x to v3.x. The Example.from_dict classmethod takes a reference Doc and a dictionary of annotations, similar to the “simple training style” in spaCy v2.x:

Migrating Doc and GoldParse

Migrating simple training style

The Language.update, Language.evaluate and TrainablePipe.update methods now all take batches of Example objects instead of Doc and GoldParse objects, or raw text and a dictionary of annotations.

Training loop

Language.begin_training and TrainablePipe.begin_training have been renamed to Language.initialize and TrainablePipe.initialize, and the methods now take a function that returns a sequence of Example objects to initialize the model instead of a list of tuples. The data examples are used to initialize the models of trainable pipeline components, which includes validating the network, inferring missing shapes and setting up the label scheme.

Packaging trained pipelines

The spacy package command now automatically builds the installable .tar.gz sdist of the Python package, so you don’t have to run this step manually anymore. To disable the behavior, you can set --build none. You can also choose to build a binary wheel (which installs more efficiently) by setting --build wheel, or to build both the sdist and wheel by setting --build sdist,wheel.

Data utilities and gold module

The spacy.gold module has been renamed to spacy.training and the conversion utilities now follow the naming format of x_to_y. This mostly affects internals, but if you’ve been using the span offset conversion utilities offsets_to_biluo_tags, biluo_tags_to_offsets or biluo_tags_to_spans, you’ll have to change your names and imports:

Migration notes for plugin maintainers

Thanks to everyone who’s been contributing to the spaCy ecosystem by developing and maintaining one of the many awesome plugins and extensions. We’ve tried to make it as easy as possible for you to upgrade your packages for spaCy v3.0. The most common use case for plugins is providing pipeline components and extension attributes. When migrating your plugin, double-check the following:

  • Use the @Language.factory decorator to register your component and assign it a name. This allows users to refer to your components by name and serialize pipelines referencing them. Remove all manual entries to the Language.factories.
  • Make sure your component factories take at least two named arguments: nlp (the current nlp object) and name (the instance name of the added component so you can identify multiple instances of the same component).
  • Update all references to nlp.add_pipe in your docs to use string names instead of the component functions.

Using GPUs in Jupyter notebooks

In Jupyter notebooks, run prefer_gpu, require_gpu or require_cpu in the same cell as spacy.load to ensure that the model is loaded on the correct device.

Due to a bug related to contextvars (see the bug report), the GPU settings may not be preserved correctly across cells, resulting in models being loaded on the wrong device or only partially on GPU.