What's New in v2.3
spaCy v2.3 features new pretrained models for five languages, word vectors for all language models, and decreased model size and loading times for models with vectors. We’ve added pretrained models for Chinese, Danish, Japanese, Polish and Romanian and updated the training data and vectors for most languages. Model packages with vectors are about 2× smaller on disk and load 2-4× faster. For the full changelog, see the release notes on GitHub. For more details and a behind-the-scenes look at the new release, see our blog post.
Expanded model families with vectors
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
md
and lg
models with word vectors for all languages, this release provides
a total of 46 model packages. For models trained using
Universal Dependencies corpora, the
training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
Chinese
This release adds support for
pkuseg
for word segmentation and
the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
Chinese tokenizer can be initialized with both pkuseg
and custom models and
the pkuseg
user dictionary is easy to customize. Note that
pkuseg
doesn’t yet ship with
pre-compiled wheels for Python 3.8. See the
usage documentation for details on how to install it on
Python 3.8.
Japanese
The updated Japanese language class switches to
SudachiPy
for word
segmentation and part-of-speech tagging. Using SudachiPy
greatly simplifies
installing spaCy for Japanese, which is now possible with a single command:
pip install spacy[ja]
.
Small CLI updates
spacy debug-data
provides the coverage of the vectors in a base model withspacy debug-data lang train dev -b base_model
spacy evaluate
supportsblank:lg
(e.g.spacy evaluate blank:en dev.json
) to evaluate the tokenization accuracy without loading a modelspacy train
on GPU restricts the CPU timing evaluation to the first iteration
Backwards incompatibilities
- If you’re training new models, you’ll want to install the package
spacy-lookups-data
, which now includes both the lemmatization tables (as in v2.2) and the normalization tables (new in v2.3). If you’re using pretrained models, nothing changes, because the relevant tables are included in the model packages. - Due to the updated Universal Dependencies training data, the fine-grained part-of-speech tags will change for many provided language models. The coarse-grained part-of-speech tagset remains the same, but the mapping from particular fine-grained to coarse-grained tags may show minor differences.
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
tagsets contain new merged tags related to contracted forms, such as
ADP_DET
for French"au"
, which maps to UPOSADP
based on the head"à"
. This increases the accuracy of the models by improving the alignment between spaCy’s tokenization and Universal Dependencies multi-word tokens used for contractions.
Migrating from spaCy 2.2
Tokenizer settings
In spaCy v2.2.2-v2.2.4, there was a change to the precedence of token_match
that gave prefixes and suffixes priority over token_match
, which caused
problems for many custom tokenizer configurations. This has been reverted in
v2.3 so that token_match
has priority over prefixes and suffixes as in v2.2.1
and earlier versions.
A new tokenizer setting url_match
has been introduced in v2.3.0 to handle
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
comma at the end of a URL) before applying the match. See the full
tokenizer documentation and try out
nlp.tokenizer.explain()
when
debugging your tokenizer configuration.
Warnings configuration
spaCy’s custom warnings have been replaced with native Python
warnings
. Instead of
setting SPACY_WARNING_IGNORE
, use the
warnings
filters
to manage warnings.
Normalization tables
The normalization tables have moved from the language data in
spacy/lang
to the
package spacy-lookups-data
.
If you’re adding data for a new language, the normalization table should be
added to spacy-lookups-data
. See
adding norm exceptions.
No preloaded vocab for models with vectors
To reduce the initial loading time, the lexemes in nlp.vocab
are no longer
loaded on initialization for models with vectors. As you process texts, the
lexemes will be added to the vocab automatically, just as in small models
without vectors.
To see the number of unique vectors and number of words with vectors, see
nlp.meta['vectors']
, for example for en_core_web_md
there are 20000
unique
vectors and 684830
words with vectors:
If required, for instance if you are working directly with word vectors rather than processing texts, you can load all lexemes for words with vectors at once:
If your workflow previously iterated over nlp.vocab
, a similar alternative is
to iterate over words with vectors instead:
Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
the set of words with vectors. For English, v2.2 md/lg
models have 1.3M
provided lexemes but only 685K words with vectors. The vectors have been updated
for most languages in v2.2, but the English models contain the same vectors for
both v2.2 and v2.3.
Lexeme.is_oov and Token.is_oov
In v2.3, Lexeme.is_oov
and Token.is_oov
are True
if the lexeme does not
have a word vector. This is equivalent to token.orth not in nlp.vocab.vectors
.
Previously in v2.2, is_oov
corresponded to whether a lexeme had stored
probability and cluster features. The probability and cluster features are no
longer included in the provided medium and large models (see the next section).
Probability and cluster features
The Token.prob
and Token.cluster
features, which are no longer used by the
core pipeline components as of spaCy v2, are no longer provided in the
pretrained models to reduce the model size. To keep these features available for
users relying on them, the prob
and cluster
features for the most frequent
1M tokens have been moved to
spacy-lookups-data
as
extra
features for the relevant languages (English, German, Greek and
Spanish).
The extra tables are loaded lazily, so if you have spacy-lookups-data
installed and your code accesses Token.prob
, the full table is loaded into the
model vocab, which will take a few seconds on initial loading. When you save
this model after loading the prob
table, the full prob
table will be saved
as part of the model vocab.
To load the probability table into a provided model, first make sure you have
spacy-lookups-data
installed. To load the table, remove the empty provided
lexeme_prob
table and then access Lexeme.prob
for any word to load the table
from spacy-lookups-data
:
If you’d like to include custom cluster
, prob
, or sentiment
tables as part
of a new model, add the data to
spacy-lookups-data
under
the entry point lg_extra
, e.g. en_extra
for English. Alternatively, you can
initialize your Vocab
with the lookups_extra
argument with a
Lookups
object that includes the tables lexeme_cluster
,
lexeme_prob
, lexeme_sentiment
or lexeme_settings
. lexeme_settings
is
currently only used to provide a custom oov_prob
. See examples in the
data
directory
in spacy-lookups-data
.
Initializing new models without extra lookups tables
When you initialize a new model with spacy init-model
,
the prob
table from spacy-lookups-data
may be loaded as part of the
initialization. If you’d like to omit this extra data as in spaCy’s provided
v2.3 models, use the new flag --omit-extra-lookups
.
Tag maps in provided models vs. blank models
The tag maps in the provided models may differ from the tag maps in the spaCy
library. You can access the tag map in a loaded model under
nlp.vocab.morphology.tag_map
.
The tag map from spacy.lang.lg.tag_map
is still used when a blank model is
initialized. If you want to provide an alternate tag map, update
nlp.vocab.morphology.tag_map
after initializing the model or if you’re using
the train CLI, you can use the new --tag-map-path
option to
provide in the tag map as a JSON dict.
If you want to export a tag map from a provided model for use with the train CLI, you can save it as a JSON dict. To only use string keys as required by JSON and to make it easier to read and edit, any internal integer IDs need to be converted back to strings: