What's New in v3.5
spaCy v3.5 introduces three new CLI commands, apply
, benchmark
and
find-threshold
, adds fuzzy matching, provides improvements to our entity
linking functionality, and includes a range of language updates and bug fixes.
New CLI commands
apply CLI
The apply
CLI can be used to apply a pipeline to one or more
.txt
, .jsonl
or .spacy
input files, saving the annotated docs in a single
.spacy
file.
benchmark CLI
The benchmark
CLI has been added to extend the existing
evaluate
functionality with a wider range of profiling subcommands.
The benchmark accuracy
CLI is introduced as an alias for evaluate
. The new
benchmark speed
CLI performs warmup rounds before measuring the speed in words
per second on batches of randomly shuffled documents from the provided data.
The output is the mean performance using batches (nlp.pipe
) with a 95%
confidence interval, e.g., profiling en_core_web_sm
on CPU:
find-threshold CLI
The find-threshold
CLI runs a series of trials
across threshold values from 0.0
to 1.0
and identifies the best threshold
for the provided score metric.
The following command runs 20 trials for the spancat
component in
my_pipeline
, recording the spans_sc_f
score for each value of the threshold
[components.spancat.threshold]
from 0.0
to 1.0
:
The find-threshold
CLI can be used with textcat_multilabel
, spancat
and
custom components with thresholds that are applied while predicting or scoring.
Fuzzy matching
New FUZZY
operators support fuzzy matching
with the Matcher
. By default, the FUZZY
operator allows a Levenshtein edit
distance of 2 and up to 30% of the pattern string length. FUZZY1
..FUZZY9
can
be used to specify the exact number of allowed edits.
Note that FUZZY
uses Levenshtein edit distance rather than Damerau-Levenshtein
edit distance, so a transposition like teh
for the
counts as two edits, one
insertion and one deletion.
If you’d prefer an alternate fuzzy matching algorithm, you can provide your own
custom method to the Matcher
or as a config option for an entity ruler and
span ruler.
FUZZY and REGEX with lists
The FUZZY
and REGEX
operators are also now supported for lists with IN
and
NOT_IN
:
Entity linking generalization
The knowledge base used for entity linking is now easier to customize and has a
new default implementation InMemoryLookupKB
.
Additional features and improvements
- Language updates:
- Extended support for Slovenian
- Fixed lookup fallback for French and Catalan lemmatizers
- Switch Russian and Ukrainian lemmatizers to
pymorphy3
- Support for editorial punctuation in Ancient Greek
- Update to Russian tokenizer exceptions
- Small fix for Dutch stop words
- Allow up to
typer
v0.7.x,mypy
0.990 andtyping_extensions
v4.4.x. - New
spacy.ConsoleLogger.v3
with expanded progress tracking. - Improved scoring behavior for
textcat
withspacy.textcat_scorer.v2
andspacy.textcat_multilabel_scorer.v2
. - Updates so that downstream components can train properly on a frozen
tok2vec
ortransformer
layer. - Allow interpolation of variables in directory names in projects.
- Support for local file system remotes for projects.
- Improve UX around
displacy.serve
when the default port is in use. - Optional
before_update
callback that is invoked at the start of each training step. - Improve performance of
SpanGroup
and fix typing issues forSpanGroup
andSpan
objects. - Patch a security vulnerability in extracting tar files.
- Add equality definition for
Vectors
. - Ensure
Vocab.to_disk
respects the exclude setting forlookups
andvectors
. - Correctly handle missing annotations in the edit tree lemmatizer.
Trained pipeline updates
- The CNN pipelines add
IS_SPACE
as atok2vec
feature fortagger
andmorphologizer
components to improve tagging of non-whitespace vs. whitespace tokens. - The transformer pipelines require
spacy-transformers
v1.2, which uses the exact alignment fromtokenizers
for fast tokenizers instead of the heuristic alignment fromspacy-alignments
. For all trained pipelines exceptja_core_news_trf
, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about thespacy-transformers
changes in the v1.2.0 release notes.
Notes about upgrading from v3.4
Validation of textcat values
An error is now raised when unsupported values are given as input to train a
textcat
or textcat_multilabel
model - ensure that values are 0.0
or 1.0
as explained in the docs.
Using the default knowledge base
As KnowledgeBase
is now an abstract class, you should call the constructor of
the new InMemoryLookupKB
instead when you want to use spaCy’s default KB
implementation:
If you’ve written a custom KB that inherits from KnowledgeBase
, you’ll need to
implement its abstract methods, or alternatively inherit from InMemoryLookupKB
instead.
Updated scorers for tokenization and textcat
We fixed a bug that inflated the token_acc
scores in v3.0-v3.4. The reported
token_acc
will drop from v3.4 to v3.5, but if token_p/r/f
stay the same,
your tokenization performance has not changed from v3.4.
For new textcat
or textcat_multilabel
configs, the new default v2
scorers:
- ignore
threshold
fortextcat
, so the reportedcats_p/r/f
may increase slightly in v3.5 even though the underlying predictions are unchanged - report the performance of only the final
textcat
ortextcat_multilabel
component in the pipeline by default - allow custom scorers to be used to score multiple
textcat
andtextcat_multilabel
components withScorer.score_cats
by restricting the evaluation to the component’s provided labels
Pipeline package version compatibility
When you’re loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn’t necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.
If you’re using one of the trained pipelines we provide, you should
run spacy download
to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
spacy validate
.
If you’ve trained your own custom pipeline and you’ve confirmed that it’s still
working as expected, you can update the spaCy version requirements in the
meta.json
:
Updating v3.4 configs
To update a config from spaCy v3.4 with the new v3.5 settings, run
init fill-config
:
In many cases (spacy train
,
spacy.load
), the new defaults will be filled in
automatically, but you’ll need to fill in the new settings to run
debug config
and debug data
.