What's New in v3.5
spaCy v3.5 introduces three new CLI commands,
find-threshold, adds fuzzy matching, provides improvements to our entity
linking functionality, and includes a range of language updates and bug fixes.
New CLI commands
apply CLI can be used to apply a pipeline to one or more
.spacy input files, saving the annotated docs in a single
benchmark CLI has been added to extend the existing
evaluate functionality with a wider range of profiling subcommands.
benchmark accuracy CLI is introduced as an alias for
evaluate. The new
benchmark speed CLI performs warmup rounds before measuring the speed in words
per second on batches of randomly shuffled documents from the provided data.
The output is the mean performance using batches (
nlp.pipe) with a 95%
confidence interval, e.g., profiling
en_core_web_sm on CPU:
find-threshold CLI runs a series of trials
across threshold values from
1.0 and identifies the best threshold
for the provided score metric.
The following command runs 20 trials for the
spancat component in
my_pipeline, recording the
spans_sc_f score for each value of the threshold
find-threshold CLI can be used with
custom components with thresholds that are applied while predicting or scoring.
FUZZY operators support fuzzy matching
Matcher. By default, the
FUZZY operator allows a Levenshtein edit
distance of 2 and up to 30% of the pattern string length.
be used to specify the exact number of allowed edits.
FUZZY uses Levenshtein edit distance rather than Damerau-Levenshtein
edit distance, so a transposition like
the counts as two edits, one
insertion and one deletion.
If you’d prefer an alternate fuzzy matching algorithm, you can provide your own
custom method to the
Matcher or as a config option for an entity ruler and
FUZZY and REGEX with lists
REGEX operators are also now supported for lists with
Entity linking generalization
The knowledge base used for entity linking is now easier to customize and has a
new default implementation
Additional features and improvements
- Language updates:
- Extended support for Slovenian
- Fixed lookup fallback for French and Catalan lemmatizers
- Switch Russian and Ukrainian lemmatizers to
- Support for editorial punctuation in Ancient Greek
- Update to Russian tokenizer exceptions
- Small fix for Dutch stop words
- Allow up to
spacy.ConsoleLogger.v3with expanded progress tracking.
- Improved scoring behavior for
- Updates so that downstream components can train properly on a frozen
- Allow interpolation of variables in directory names in projects.
- Support for local file system remotes for projects.
- Improve UX around
displacy.servewhen the default port is in use.
before_updatecallback that is invoked at the start of each training step.
- Improve performance of
SpanGroupand fix typing issues for
- Patch a security vulnerability in extracting tar files.
- Add equality definition for
Vocab.to_diskrespects the exclude setting for
- Correctly handle missing annotations in the edit tree lemmatizer.
Trained pipeline updates
- The CNN pipelines add
morphologizercomponents to improve tagging of non-whitespace vs. whitespace tokens.
- The transformer pipelines require
spacy-transformersv1.2, which uses the exact alignment from
tokenizersfor fast tokenizers instead of the heuristic alignment from
spacy-alignments. For all trained pipelines except
ja_core_news_trf, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about the
spacy-transformerschanges in the v1.2.0 release notes.
Notes about upgrading from v3.4
Validation of textcat values
An error is now raised when unsupported values are given as input to train a
textcat_multilabel model - ensure that values are
as explained in the docs.
Using the default knowledge base
KnowledgeBase is now an abstract class, you should call the constructor of
InMemoryLookupKB instead when you want to use spaCy’s default KB
If you’ve written a custom KB that inherits from
KnowledgeBase, you’ll need to
implement its abstract methods, or alternatively inherit from
Updated scorers for tokenization and textcat
We fixed a bug that inflated the
token_acc scores in v3.0-v3.4. The reported
token_acc will drop from v3.4 to v3.5, but if
token_p/r/f stay the same,
your tokenization performance has not changed from v3.4.
textcat_multilabel configs, the new default
textcat, so the reported
cats_p/r/fmay increase slightly in v3.5 even though the underlying predictions are unchanged
- report the performance of only the final
textcat_multilabelcomponent in the pipeline by default
- allow custom scorers to be used to score multiple
Scorer.score_catsby restricting the evaluation to the component’s provided labels
Pipeline package version compatibility
When you’re loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn’t necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.
If you’re using one of the trained pipelines we provide, you should
spacy download to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
If you’ve trained your own custom pipeline and you’ve confirmed that it’s still
working as expected, you can update the spaCy version requirements in the
Updating v3.4 configs
To update a config from spaCy v3.4 with the new v3.5 settings, run
In many cases (
spacy.load), the new defaults will be filled in
automatically, but you’ll need to fill in the new settings to run
debug config and