Trained Models & Pipelines
python -m spacy download en_core_web_smimport spacynlp = spacy.load("en_core_web_sm")import en_core_web_smnlp = en_core_web_sm.load()doc = nlp("This is a sentence.")print([(w.text, w.pos_) for w in doc])
Package naming conventions
In general, spaCy expects all pipeline packages to follow the naming convention
[lang]_[name]. For spaCy’s pipelines, we also chose to divide the name into
Type: Capabilities (e.g.
corefor general-purpose pipeline with tagging, parsing, lemmatization and named entity recognition, or
depfor only tagging, parsing and lemmatization).
Genre: Type of text the pipeline is trained on, e.g.
Size: Package size indicator,
trfpipelines have no static word vectors.
For pipelines with default vectors,
mdhas a reduced word vector table with 20k unique vectors for ~500k words and
lghas a large word vector table with ~500k entries.
For pipelines with floret vectors,
mdvector tables have 50k entries and
lgvector tables have 200k entries.
en_core_web_sm is a small English
pipeline trained on written web text (blogs, news, comments), that includes
vocabulary, syntax and entities.
Additionally, the pipeline package versioning reflects both the compatibility
with spaCy, as well as the model version. A package version
a: spaCy major version. For example,
2for spaCy v2.x.
b: spaCy minor version. For example,
3for spaCy v2.3.x.
c: Model version. Different model config: e.g. from being trained on different data, with different parameters, for different numbers of iterations, with different vectors, etc.
For a detailed compatibility overview, see the
This is also the source of spaCy’s internal compatibility check, performed when
you run the
Trained pipeline design
The spaCy v3 trained pipelines are designed to be efficient and configurable. For example, multiple components can share a common “token-to-vector” model and it’s easy to swap out or disable the lemmatizer. The pipelines are designed to be efficient in terms of speed and size and work well when the pipeline is run in full.
When modifying a trained pipeline, it’s important to understand how the
components depend on each other. Unlike spaCy v2, where the
ner components were all independent, some v3 components depend on
earlier components in the pipeline. As a result, disabling or reordering
components can affect the annotation quality or lead to warnings and errors.
Main changes from spaCy v2 models:
Tok2Veccomponent may be a separate, shared component. A component like a tagger or parser can listen to an earlier
transformerrather than having its own separate tok2vec layer.
- Rule-based exceptions move from individual components to the
attribute_ruler. Lemma and POS exceptions move from the tokenizer exceptions to the attribute ruler and the tag map and morph rules move from the tagger to the attribute ruler.
- The lemmatizer tables and processing move from the vocab and tagger to a
CNN/CPU pipeline design
parsercomponents listen to the
tok2veccomponent. If the lemmatizer is trainable (v3.3+),
lemmatizeralso listens to
token.posif there is no
attribute_ruleradditionally makes sure whitespace is tagged consistently and copies
token.tagif there is no tagger. For English, the attribute ruler can improve its mapping from
token.posif dependency parses from a
parserare present, but the parser is not required.
lemmatizercomponent for many languages requires
token.posannotation from either
nercomponent is independent with its own internal tok2vec layer.
CNN/CPU pipelines with floret vectors
The Finnish, Korean and Swedish
lg pipelines use
floret vectors instead of default vectors. If you’re
running a trained pipeline on texts and working with
you shouldn’t notice any difference with floret vectors. With floret vectors no
tokens are out-of-vocabulary, so
False for all tokens.
If you access vectors directly for similarity comparisons, there are a few differences because floret vectors don’t include a fixed word list like the vector keys for default vectors.
If your workflow iterates over the vector keys, you need to use an external word list instead:
Vectors.most_similaris not supported because there’s no fixed list of vectors to compare your vectors to.
Transformer pipeline design
In the transformer (
trf) models, the
ner (if present)
all listen to the
transformer component. The
lemmatizer have the same configuration as in the CNN models.
Modifying the default pipeline
For faster processing, you may only want to run a subset of the components in a
trained pipeline. The
exclude arguments to
spacy.load let you control which components are
loaded and run. Disabled components are loaded in the background so it’s
possible to reenable them in the same pipeline in the future with
nlp.enable_pipe. To skip loading a component
exclude instead of
Disable part-of-speech tagging and lemmatization
To disable part-of-speech tagging and lemmatization, disable the
Use senter rather than parser for fast sentence segmentation
If you need fast sentence segmentation without dependency parses, disable the
parser use the
senter component instead:
senter component is ~10× faster than the parser and more accurate
than the rule-based
Switch from trainable lemmatizer to default lemmatizer
Since v3.3, a number of pipelines use a trainable lemmatizer. You can check whether the lemmatizer is trainable:
If you’d like to switch to a non-trainable lemmatizer that’s similar to v3.2 or earlier, you can replace the trainable lemmatizer with the default non-trainable lemmatizer:
Switch from rule-based to lookup lemmatization
For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish pipelines, you can swap out a trainable or rule-based lemmatizer for a lookup lemmatizer:
Disable everything except NER
For the non-transformer models, the
ner component is independent, so you can
disable everything else:
In the transformer models,
ner listens to the
transformer component, so you
can disable all components related tagging, parsing, and lemmatization.
Move NER to the end of the pipeline
For access to
LEMMA features in an
ner to the
end of the pipeline after