What's New in v3.3
spaCy v3.3 improves the speed of core pipeline components, adds a new trainable lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.
Speed improvements
v3.3 includes a slew of speed improvements:
- Speed up parser and NER by using constant-time head lookups.
- Support unnormalized softmax probabilities in
spacy.Tagger.v2
to speed up inference for tagger, morphologizer, senter and trainable lemmatizer. - Speed up parser projectivization functions.
- Replace
Ragged
with fasterAlignmentArray
inExample
for training. - Improve
Matcher
speed. - Improve serialization speed for empty
Doc.spans
.
For longer texts, the trained pipeline speeds improve 15% or more in
prediction. We benchmarked en_core_web_md
(same components as in v3.2) and
de_core_news_md
(with the new trainable lemmatizer) across a range of text
sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:
Intel Xeon W-2265
Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
---|---|---|---|---|
en_core_web_md | 100 | 17292 | 17441 | 0.86% |
(=same components) | 1000 | 15408 | 16024 | 4.00% |
10000 | 12798 | 15346 | 19.91% | |
de_core_news_md | 100 | 20221 | 19321 | -4.45% |
(+v3.3 trainable lemmatizer) | 1000 | 17480 | 17345 | -0.77% |
10000 | 14513 | 17036 | 17.38% |
Apple M1
Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
---|---|---|---|---|
en_core_web_md | 100 | 18272 | 18408 | 0.74% |
(=same components) | 1000 | 18794 | 19248 | 2.42% |
10000 | 15144 | 17513 | 15.64% | |
de_core_news_md | 100 | 19227 | 19591 | 1.89% |
(+v3.3 trainable lemmatizer) | 1000 | 20047 | 20628 | 2.90% |
10000 | 15921 | 18546 | 16.49% |
Trainable lemmatizer
The new trainable lemmatizer component uses edit trees to transform tokens into lemmas. Try out the trainable lemmatizer with the training quickstart!
displaCy support for overlapping spans and arcs
displaCy now supports overlapping spans with a new
span
style and multiple arcs with different labels
between the same tokens for dep
visualizations.
Overlapping spans can be visualized for any spans key in doc.spans
:
Additional features and improvements
- Config comparisons with
spacy debug diff-config
. - Span suggester debugging with
SpanCategorizer.set_candidates
. - Big endian support with
thinc-bigendian-ops
and updates to makefloret
,murmurhash
, Thinc and spaCy endian neutral. - Initial support for Lower Sorbian and Upper Sorbian.
- Language updates for English, French, Italian, Japanese, Korean, Norwegian, Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
- New noun chunks for Finnish.
Trained pipelines
New trained pipelines
v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use the new trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
fi_core_news_sm | Finnish | 92.5 | 71.9 | 75.9 |
fi_core_news_md | Finnish | 95.9 | 78.6 | 80.6 |
fi_core_news_lg | Finnish | 96.2 | 79.4 | 82.4 |
ko_core_news_sm | Korean | 86.1 | 65.6 | 71.3 |
ko_core_news_md | Korean | 94.7 | 80.9 | 83.1 |
ko_core_news_lg | Korean | 94.7 | 81.3 | 85.3 |
sv_core_news_sm | Swedish | 95.0 | 75.9 | 74.7 |
sv_core_news_md | Swedish | 96.3 | 78.5 | 79.3 |
sv_core_news_lg | Swedish | 96.3 | 79.1 | 81.1 |
Pipeline updates
The following languages switch from lookup or rule-based lemmatizers to the new trainable lemmatizer: Danish, Dutch, German, Greek, Italian, Lithuanian, Norwegian, Polish, Portuguese and Romanian. The overall lemmatizer accuracy improves for all of these pipelines, but be aware that the types of errors may look quite different from the lookup-based lemmatizers. If you’d prefer to continue using the previous lemmatizer, you can switch from the trainable lemmatizer to a non-trainable lemmatizer.
Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
---|---|---|
da_core_news_md | 84.9 | 94.8 |
de_core_news_md | 73.4 | 97.7 |
el_core_news_md | 56.5 | 88.9 |
fi_core_news_md | - | 86.2 |
it_core_news_md | 86.6 | 97.2 |
ko_core_news_md | - | 90.0 |
lt_core_news_md | 71.1 | 84.8 |
nb_core_news_md | 76.7 | 97.1 |
nl_core_news_md | 81.5 | 94.0 |
pl_core_news_md | 87.1 | 93.7 |
pt_core_news_md | 76.7 | 96.9 |
ro_core_news_md | 81.8 | 95.5 |
sv_core_news_md | - | 95.5 |
In addition, the vectors in the English pipelines are deduplicated to improve
the pruned vectors in the md
models and reduce the lg
model size.
Notes about upgrading from v3.2
Span comparisons
Span comparisons involving ordering (<
, <=
, >
, >=
) now take all span
attributes into account (start, end, label, and KB ID) so spans may be sorted in
a slightly different order.
Whitespace annotation
During training, annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens in order to allow custom whitespace annotation.
Doc.from_docs
Doc.from_docs
now includes Doc.tensor
by default and
supports excludes with an exclude
argument in the same format as
Doc.to_bytes
. The supported exclude fields are spans
, tensor
and
user_data
.
Docs including Doc.tensor
may be quite a bit larger in RAM, so to exclude
Doc.tensor
as in v3.2:
Using trained pipelines with floret vectors
If you’re running a new trained pipeline for Finnish, Korean or Swedish on new
texts and working with Doc
objects, you shouldn’t notice any difference with
floret vectors vs. default vectors.
If you use vectors for similarity comparisons, there are a few differences, mainly because a floret pipeline doesn’t include any kind of frequency-based word list similar to the list of in-vocabulary vector keys with default vectors.
-
If your workflow iterates over the vector keys, you should use an external word list instead:
-
Vectors.most_similar
is not supported because there’s no fixed list of vectors to compare your vectors to.
Pipeline package version compatibility
When you’re loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn’t necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.
If you’re using one of the trained pipelines we provide, you should
run spacy download
to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
spacy validate
.
If you’ve trained your own custom pipeline and you’ve confirmed that it’s still
working as expected, you can update the spaCy version requirements in the
meta.json
:
Updating v3.2 configs
To update a config from spaCy v3.2 with the new v3.3 settings, run
init fill-config
:
In many cases (spacy train
,
spacy.load
), the new defaults will be filled in
automatically, but you’ll need to fill in the new settings to run
debug config
and debug data
.
To see the speed improvements for the
Tagger
architecture, edit your config to switch
from spacy.Tagger.v1
to spacy.Tagger.v2
and then run init fill-config
.