Documentation

What's New in v3.4

New features and how to upgrade

spaCy v3.4 brings typing and speed improvements along with new vectors for English CNN pipelines and new trained pipelines for Croatian. This release also includes prebuilt linux aarch64 wheels for all spaCy dependencies distributed by Explosion.

Typing improvements

spaCy v3.4 supports pydantic v1.9 and mypy 0.950+ through extensive updates to types in Thinc v8.1.

Speed improvements

  • For the parser, use C saxpy/sgemm provided by the Ops implementation in order to use Accelerate through thinc-apple-ops.
  • Improved speed of vector lookups.
  • Improved speed for Example.get_aligned_parse and Example.get_aligned.

Additional features and improvements

  • Min/max {n,m} operator for Matcher patterns.
  • Language updates:
    • Improve tokenization for Cyrillic combining diacritics.
    • Improve English tokenizer exceptions for contractions with this/that/these/those.
  • Updated spacy project clone to try both main and master branches by default.
  • Added confidence threshold for named entity linker.
  • Improved handling of Typer optional default values for init_config_cli.
  • Added cycle detection in parser projectivization methods.
  • Added counts for NER labels in debug data.
  • Support for adding NVTX ranges to TrainablePipe components.
  • Support env variable SPACY_NUM_BUILD_JOBS to specify the number of build jobs to run in parallel with pip.

Trained pipelines

New trained pipelines

v3.4 introduces new CPU/CNN pipelines for Croatian, which use the trainable lemmatizer and floret vectors. Due to the use of Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

PackageUPOSParser LASNER F
hr_core_news_sm96.677.576.1
hr_core_news_md97.380.181.8
hr_core_news_lg97.580.483.0

Pipeline updates

All CNN pipelines have been extended with whitespace augmentation.

The English CNN pipelines have new word vectors:

PackageModel VersionTAGParser LASNER F
en_core_web_mdv3.3.097.390.184.6
en_core_web_mdv3.4.097.290.385.5
en_core_web_lgv3.3.097.490.185.3
en_core_web_lgv3.4.097.390.285.6

Notes about upgrading from v3.3

Doc.has_vector

Doc.has_vector now matches Token.has_vector and Span.has_vector: it returns True if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.

Using trained pipelines with floret vectors

If you’re using a trained pipeline for Croatian, Finnish, Korean or Swedish with new texts and working with Doc objects, you shouldn’t notice any difference between floret vectors and default vectors.

If you use vectors for similarity comparisons, there are a few differences, mainly because a floret pipeline doesn’t include any kind of frequency-based word list similar to the list of in-vocabulary vector keys with default vectors.

  • If your workflow iterates over the vector keys, you should use an external word list instead:

  • Vectors.most_similar is not supported because there’s no fixed list of vectors to compare your vectors to.

Pipeline package version compatibility

When you’re loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn’t necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.

If you’re using one of the trained pipelines we provide, you should run spacy download to update to the latest version. To see an overview of all installed packages and their compatibility, you can run spacy validate.

If you’ve trained your own custom pipeline and you’ve confirmed that it’s still working as expected, you can update the spaCy version requirements in the meta.json:

Updating v3.3 configs

To update a config from spaCy v3.3 with the new v3.4 settings, run init fill-config:

In many cases (spacy train, spacy.load), the new defaults will be filled in automatically, but you’ll need to fill in the new settings to run debug config and debug data.