spaCy v3.4 brings typing and speed improvements along with new vectors for English CNN pipelines and new trained pipelines for Croatian. This release also includes prebuilt linux aarch64 wheels for all spaCy dependencies distributed by Explosion.
spaCy v3.4 supports pydantic v1.9 and mypy 0.950+ through extensive updates to types in Thinc v8.1.
- For the parser, use C
sgemmprovided by the
Opsimplementation in order to use Accelerate through
- Improved speed of vector lookups.
- Improved speed for
- Language updates:
- Improve tokenization for Cyrillic combining diacritics.
- Improve English tokenizer exceptions for contractions with this/that/these/those.
spacy project cloneto try both
masterbranches by default.
- Added confidence threshold for named entity linker.
- Improved handling of Typer optional default values for
- Added cycle detection in parser projectivization methods.
- Added counts for NER labels in
- Support for adding NVTX ranges to
- Support env variable
SPACY_NUM_BUILD_JOBSto specify the number of build jobs to run in parallel with
v3.4 introduces new CPU/CNN pipelines for Croatian, which use the trainable lemmatizer and floret vectors. Due to the use of Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
|Package||UPOS||Parser LAS||NER F|
All CNN pipelines have been extended with whitespace augmentation.
The English CNN pipelines have new word vectors:
|Package||Model Version||TAG||Parser LAS||NER F|
Doc.has_vector now matches
True if at least one token in the doc has a vector rather than
checking only whether the vocab contains vectors.
Using trained pipelines with floret vectors
If you’re using a trained pipeline for Croatian, Finnish, Korean or Swedish with
new texts and working with
Doc objects, you shouldn’t notice any difference
between floret vectors and default vectors.
If you use vectors for similarity comparisons, there are a few differences, mainly because a floret pipeline doesn’t include any kind of frequency-based word list similar to the list of in-vocabulary vector keys with default vectors.
If your workflow iterates over the vector keys, you should use an external word list instead:
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors] + lexemes = [nlp.vocab[word] for word in external_word_list]
Vectors.most_similaris not supported because there’s no fixed list of vectors to compare your vectors to.
When you’re loading a pipeline package trained with an earlier version of spaCy v3, you will see a warning telling you that the pipeline may be incompatible. This doesn’t necessarily have to be true, but we recommend running your pipelines against your test suite or evaluation data to make sure there are no unexpected results.
If you’re using one of the trained pipelines we provide, you should
spacy download to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
If you’ve trained your own custom pipeline and you’ve confirmed that it’s still
working as expected, you can update the spaCy version requirements in the
- "spacy_version": ">=3.3.0,<3.4.0", + "spacy_version": ">=3.3.0,<3.5.0",
Updating v3.3 configs
To update a config from spaCy v3.3 with the new v3.4 settings, run
python -m spacy init fill-config config-v3.3.cfg config-v3.4.cfg