Models

Models

Downloadable pretrained models for spaCy

The models directory includes two types of pretrained models:

  1. Core models: General-purpose pretrained models to predict named entities, part-of-speech tags and syntactic dependencies. Can be used out-of-the-box and fine-tuned on more specific data.
  2. Starter models: Transfer learning starter packs with pretrained weights you can initialize your models with to achieve better accuracy. They can include word vectors (which will be used as features during training) or other pretrained representations like BERT. These models don’t include components for specific tasks like NER or text classification and are intended to be used as base models when training your own models.

Quickstart

Install a default model, get the code to load it from within spaCy and test it.

Language
Loading style
Options
python -m spacy download zh_core_web_smimport spacynlp = spacy.load("zh_core_web_sm")import zh_core_web_smnlp = zh_core_web_sm.load()doc = nlp("No text available yet")print([(w.text, w.pos_) for w in doc])python -m spacy download da_core_news_smimport spacynlp = spacy.load("da_core_news_sm")import da_core_news_smnlp = da_core_news_sm.load()doc = nlp("Dette er en sætning.")print([(w.text, w.pos_) for w in doc])python -m spacy download nl_core_news_smimport spacynlp = spacy.load("nl_core_news_sm")import nl_core_news_smnlp = nl_core_news_sm.load()doc = nlp("Dit is een zin.")print([(w.text, w.pos_) for w in doc])python -m spacy download en_core_web_smimport spacynlp = spacy.load("en_core_web_sm")import en_core_web_smnlp = en_core_web_sm.load()doc = nlp("This is a sentence.")print([(w.text, w.pos_) for w in doc])python -m spacy download fr_core_news_smimport spacynlp = spacy.load("fr_core_news_sm")import fr_core_news_smnlp = fr_core_news_sm.load()doc = nlp("C'est une phrase.")print([(w.text, w.pos_) for w in doc])python -m spacy download de_core_news_smimport spacynlp = spacy.load("de_core_news_sm")import de_core_news_smnlp = de_core_news_sm.load()doc = nlp("Dies ist ein Satz.")print([(w.text, w.pos_) for w in doc])python -m spacy download el_core_news_smimport spacynlp = spacy.load("el_core_news_sm")import el_core_news_smnlp = el_core_news_sm.load()doc = nlp("Αυτή είναι μια πρόταση.")print([(w.text, w.pos_) for w in doc])python -m spacy download it_core_news_smimport spacynlp = spacy.load("it_core_news_sm")import it_core_news_smnlp = it_core_news_sm.load()doc = nlp("Questa è una frase.")print([(w.text, w.pos_) for w in doc])python -m spacy download ja_core_news_smimport spacynlp = spacy.load("ja_core_news_sm")import ja_core_news_smnlp = ja_core_news_sm.load()doc = nlp("これは文章です。")print([(w.text, w.pos_) for w in doc])python -m spacy download lt_core_news_smimport spacynlp = spacy.load("lt_core_news_sm")import lt_core_news_smnlp = lt_core_news_sm.load()doc = nlp("No text available yet")print([(w.text, w.pos_) for w in doc])python -m spacy download nb_core_news_smimport spacynlp = spacy.load("nb_core_news_sm")import nb_core_news_smnlp = nb_core_news_sm.load()doc = nlp("Dette er en setning.")print([(w.text, w.pos_) for w in doc])python -m spacy download pl_core_news_smimport spacynlp = spacy.load("pl_core_news_sm")import pl_core_news_smnlp = pl_core_news_sm.load()doc = nlp("To jest zdanie.")print([(w.text, w.pos_) for w in doc])python -m spacy download pt_core_news_smimport spacynlp = spacy.load("pt_core_news_sm")import pt_core_news_smnlp = pt_core_news_sm.load()doc = nlp("Esta é uma frase.")print([(w.text, w.pos_) for w in doc])python -m spacy download ro_core_news_smimport spacynlp = spacy.load("ro_core_news_sm")import ro_core_news_smnlp = ro_core_news_sm.load()doc = nlp("Aceasta este o propoziție.")print([(w.text, w.pos_) for w in doc])python -m spacy download es_core_news_smimport spacynlp = spacy.load("es_core_news_sm")import es_core_news_smnlp = es_core_news_sm.load()doc = nlp("Esto es una frase.")print([(w.text, w.pos_) for w in doc])python -m spacy download xx_ent_wiki_smimport spacynlp = spacy.load("xx_ent_wiki_sm")import xx_ent_wiki_smnlp = xx_ent_wiki_sm.load()doc = nlp("This is a sentence about Facebook.")print([(ent.text, ent.label) for ent in doc.ents])

Model architecture

spaCy v2.0 features new neural models for tagging, parsing and entity recognition. The models have been designed and implemented from scratch specifically for spaCy, to give you an unmatched balance of speed, size and accuracy. A novel bloom embedding strategy with subword features is used to support huge vocabularies in tiny tables. Convolutional layers with residual connections, layer normalization and maxout non-linearity are used, giving much better efficiency than the standard BiLSTM solution.

The parser and NER use an imitation learning objective to deliver accuracy in-line with the latest research systems, even when evaluated from raw text. With these innovations, spaCy v2.0’s models are 10× smaller, 20% more accurate, and even cheaper to run than the previous generation. The current architecture hasn’t been published yet, but in the meantime we prepared a video that explains how the models work, with particular focus on NER.

The parsing model is a blend of recent results. The two recent inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at Bar Ilan1, and the SyntaxNet team from Google. The foundation of the parser is still based on the work of Joakim Nivre2, who introduced the transition-based framework3, the arc-eager transition system, and the imitation learning objective. The model is implemented using Thinc, spaCy’s machine learning library. We first predict context-sensitive vectors for each word in the input:

(embed_lower | embed_prefix | embed_suffix | embed_shape)
    >> Maxout(token_width)
    >> convolution ** 4

This convolutional layer is shared between the tagger, parser and NER, and will also be shared by the future neural lemmatizer. Because the parser shares these layers with the tagger, the parser does not require tag features. I got this trick from David Weiss’s “Stack Combination” paper4.

To boost the representation, the tagger actually predicts a “super tag” with POS, morphology and dependency label5. The tagger predicts these supertags by adding a softmax layer onto the convolutional layer – so, we’re teaching the convolutional layer to give us a representation that’s one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions, too). The parser model makes a state vector by concatenating the vector representations for its context tokens. The current context tokens:

Context tokensDescription
S0, S1, S2Top three words on the stack.
B0, B1First two words of the buffer.
S0L1, S1L1, S2L1, B0L1, B1L1
S0L2, S1L2, S2L2, B0L2, B1L2
Leftmost and second leftmost children of S0, S1, S2, B0 and B1.
S0R1, S1R1, S2R1, B0R1, B1R1
S0R2, S1R2, S2R2, B0R2, B1R2
Rightmost and second rightmost children of S0, S1, S2, B0 and B1.

This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately, there’s a way to structure the computation to save some expense (and make it more GPU-friendly).

The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition4). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature with respect to the words in the batch. (Note that our token vectors come from the CNN — so we can’t play this trick over the vocabulary. That’s how Stanford’s NN parser3 works — and why its model is so big.)

This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don’t have to worry about variable-length batch sizes, and we don’t have to implement the dynamic oracle in CUDA to train.

Currently the parser’s loss function is multi-label log loss6, as the dynamic oracle allows multiple states to be 0 cost. This is defined as follows, where gZ is the sum of the scores assigned to gold classes:

(exp(score) / Z) - (exp(score) / gZ)

Model naming conventions

In general, spaCy expects all model packages to follow the naming convention of [lang_[name]]. For spaCy’s models, we also chose to divide the name into three components:

  1. Type: Model capabilities (e.g. core for general-purpose model with vocabulary, syntax, entities and word vectors, or depent for only vocab, syntax and entities).
  2. Genre: Type of text the model is trained on, e.g. web or news.
  3. Size: Model size indicator, sm, md or lg.

For example, en_core_web_sm is a small English model trained on written web text (blogs, news, comments), that includes vocabulary, vectors, syntax and entities.

Model versioning

Additionally, the model versioning reflects both the compatibility with spaCy, as well as the major and minor model version. A model version a.b.c translates to:

  • a: spaCy major version. For example, 2 for spaCy v2.x.
  • b: Model major version. Models with a different major version can’t be loaded by the same code. For example, changing the width of the model, adding hidden layers or changing the activation changes the model major version.
  • c: Model minor version. Same model structure, but different parameter values, e.g. from being trained on different data, for different numbers of iterations, etc.

For a detailed compatibility overview, see the compatibility.json in the models repository. This is also the source of spaCy’s internal compatibility check, performed when you run the download command.