What's New in v3.5
spaCy v3.5 introduces three new CLI commands, apply, benchmark and
find-threshold, adds fuzzy matching, provides improvements to our entity
linking functionality, and includes a range of language updates and bug fixes.
New CLI commands
apply CLI
The apply CLI can be used to apply a pipeline to one or more
.txt, .jsonl or .spacy input files, saving the annotated docs in a single
.spacy file.
benchmark CLI
The benchmark CLI has been added to extend the existing
evaluate functionality with a wider range of profiling subcommands.
The benchmark accuracy CLI is introduced as an alias for evaluate. The new
benchmark speed CLI performs warmup rounds before measuring the speed in words
per second on batches of randomly shuffled documents from the provided data.
The output is the mean performance using batches (nlp.pipe) with a 95%
confidence interval, e.g., profiling en_core_web_sm on CPU:
find-threshold CLI
The find-threshold CLI runs a series of trials
across threshold values from 0.0 to 1.0 and identifies the best threshold
for the provided score metric.
The following command runs 20 trials for the spancat component in
my_pipeline, recording the spans_sc_f score for each value of the threshold
[components.spancat.threshold] from 0.0 to 1.0:
The find-threshold CLI can be used with textcat_multilabel, spancat and
custom components with thresholds that are applied while predicting or scoring.
Fuzzy matching
New FUZZY operators support fuzzy matching
with the Matcher. By default, the FUZZY operator allows a Levenshtein edit
distance of 2 and up to 30% of the pattern string length. FUZZY1..FUZZY9 can
be used to specify the exact number of allowed edits.
Note that FUZZY uses Levenshtein edit distance rather than Damerau-Levenshtein
edit distance, so a transposition like teh for the counts as two edits, one
insertion and one deletion.
If you’d prefer an alternate fuzzy matching algorithm, you can provide your own
custom method to the Matcher or as a config option for an entity ruler and
span ruler.
FUZZY and REGEX with lists
The FUZZY and REGEX operators are also now supported for lists with IN and
NOT_IN:
Entity linking generalization
The knowledge base used for entity linking is now easier to customize and has a
new default implementation InMemoryLookupKB.
Additional features and improvements
- Language updates:
- Extended support for Slovenian
- Fixed lookup fallback for French and Catalan lemmatizers
- Switch Russian and Ukrainian lemmatizers to
pymorphy3 - Support for editorial punctuation in Ancient Greek
- Update to Russian tokenizer exceptions
- Small fix for Dutch stop words
- Allow up to
typerv0.7.x,mypy0.990 andtyping_extensionsv4.4.x. - New
spacy.ConsoleLogger.v3with expanded progress tracking. - Improved scoring behavior for
textcatwithspacy.textcat_scorer.v2andspacy.textcat_multilabel_scorer.v2. - Updates so that downstream components can train properly on a frozen
tok2vecortransformerlayer. - Allow interpolation of variables in directory names in projects.
- Support for local file system remotes for projects.
- Improve UX around
displacy.servewhen the default port is in use. - Optional
before_updatecallback that is invoked at the start of each training step. - Improve performance of
SpanGroupand fix typing issues forSpanGroupandSpanobjects. - Patch a security vulnerability in extracting tar files.
- Add equality definition for
Vectors. - Ensure
Vocab.to_diskrespects the exclude setting forlookupsandvectors. - Correctly handle missing annotations in the edit tree lemmatizer.
Trained pipeline updates
- The CNN pipelines add
IS_SPACEas atok2vecfeature fortaggerandmorphologizercomponents to improve tagging of non-whitespace vs. whitespace tokens. - The transformer pipelines require
spacy-transformersv1.2, which uses the exact alignment fromtokenizersfor fast tokenizers instead of the heuristic alignment fromspacy-alignments. For all trained pipelines exceptja_core_news_trf, the alignments between spaCy tokens and transformer tokens may be slightly different. More details about thespacy-transformerschanges in the v1.2.0 release notes.
Notes about upgrading from v3.4
Validation of textcat values
An error is now raised when unsupported values are given as input to train a
textcat or textcat_multilabel model - ensure that values are 0.0 or 1.0
as explained in the docs.
Using the default knowledge base
As KnowledgeBase is now an abstract class, you should call the constructor of
the new InMemoryLookupKB instead when you want to use spaCy’s default KB
implementation:
If you’ve written a custom KB that inherits from KnowledgeBase, you’ll need to
implement its abstract methods, or alternatively inherit from InMemoryLookupKB
instead.
Updated scorers for tokenization and textcat
We fixed a bug that inflated the token_acc scores in v3.0-v3.4. The reported
token_acc will drop from v3.4 to v3.5, but if token_p/r/f stay the same,
your tokenization performance has not changed from v3.4.
For new textcat or textcat_multilabel configs, the new default v2 scorers:
- ignore
thresholdfortextcat, so the reportedcats_p/r/fmay increase slightly in v3.5 even though the underlying predictions are unchanged - report the performance of only the final
textcatortextcat_multilabelcomponent in the pipeline by default - allow custom scorers to be used to score multiple
textcatandtextcat_multilabelcomponents withScorer.score_catsby restricting the evaluation to the component’s provided labels
Pipeline package version compatibility
When you’re loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn’t necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.
If you’re using one of the trained pipelines we provide, you should
run spacy download to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
spacy validate.
If you’ve trained your own custom pipeline and you’ve confirmed that it’s still
working as expected, you can update the spaCy version requirements in the
meta.json:
Updating v3.4 configs
To update a config from spaCy v3.4 with the new v3.5 settings, run
init fill-config:
In many cases (spacy train,
spacy.load), the new defaults will be filled in
automatically, but you’ll need to fill in the new settings to run
debug config and debug data.