Model Architectures

Pre-defined model architectures included with the core library

A model architecture is a function that wires up a Model instance, which you can then use in a pipeline component or as a layer of a larger network. This page documents spaCy’s built-in architectures that are used for different NLP tasks. All trainable built-in components expect a model argument defined in the config and document their the default architecture. Custom architectures can be registered using the @spacy.registry.architectures decorator and used as part of the training config. Also see the usage documentation on layers and model architectures.

Tok2Vec architectures
Source

spacy.Tok2Vec.v2

Construct a tok2vec model out of two subnetworks: one for embedding and one for encoding. See the “Embed, Encode, Attend, Predict” blog post for background.

Name	Description
`embed`	Embed tokens into context-independent word vector representations. For example, CharacterEmbed or MultiHashEmbed. Model[List[Doc], List[Floats2d]]
`encode`	Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, MaxoutWindowEncoder. Model[List[Floats2d], List[Floats2d]]
CREATES	The model using the architecture. Model[List[Doc], List[Floats2d]]

spacy.HashEmbedCNN.v2

Build spaCy’s “standard” tok2vec layer. This layer is defined by a MultiHashEmbed embedding layer that uses subword features, and a MaxoutWindowEncoder encoding layer consisting of a CNN and a layer-normalized maxout activation function.

Name	Description
`width`	The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are `96`, `128` or `300`. int
`depth`	The number of convolutional layers to use. Recommended values are between `2` and `8`. int
`embed_size`	The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between `2000` and `10000`. int
`window_size`	The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be `depth * window_size * 2 + 1`, so a 4-layer network with a window size of `2` will be sensitive to 17 words at a time. Recommended value is `1`. int
`maxout_pieces`	The number of pieces to use in the maxout non-linearity. If `1`, the `Mish` non-linearity is used instead. Recommended values are `1`-`3`. int
`subword_features`	Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. bool
`pretrained_vectors`	Whether to also use static vectors. bool
CREATES	The model using the architecture. Model[List[Doc], List[Floats2d]]

spacy.Tok2VecListener.v1

A listener is used as a sublayer within a component such as a DependencyParser, EntityRecognizeror TextCategorizer. Usually you’ll have multiple listeners connecting to a single upstream Tok2Vec component that’s earlier in the pipeline. The listener layers act as proxies, passing the predictions from the Tok2Vec component into downstream components, and communicating gradients back upstream.

Instead of defining its own Tok2Vec instance, a model architecture like Tagger can define a listener as its tok2vec argument that connects to the shared tok2vec component in the pipeline.

Listeners work by caching the Tok2Vec output for a given batch of Docs. This means that in order for a component to work with the listener, the batch of Docs passed to the listener must be the same as the batch of Docs passed to the Tok2Vec. As a result, any manipulation of the Docs which would affect Tok2Vec output, such as to create special contexts or remove Docs for which no prediction can be made, must happen inside the model, after the call to the Tok2Vec component.

Name	Description
`width`	The width of the vectors produced by the “upstream” `Tok2Vec` component. int
`upstream`	A string to identify the “upstream” `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You’ll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. str
CREATES	The model using the architecture. Model[List[Doc], List[Floats2d]]

spacy.MultiHashEmbed.v2

Construct an embedding layer that separately embeds a number of lexical attributes using hash embedding, concatenates the results, and passes it through a feed-forward subnetwork to build a mixed representation. The features used can be configured with the attrs argument. The suggested attributes are NORM, PREFIX, SUFFIX and SHAPE. This lets the model take into account some subword information, without construction a fully character-based representation. If pretrained vectors are available, they can be included in the representation as well, with the vectors table kept static (i.e. it’s not updated).

Name	Description
`width`	The output width. Also used as the width of the embedding tables. Recommended values are between `64` and `300`. If static vectors are included, a learned linear layer is used to map the vectors to the specified width before concatenating it with the other embedding outputs. A single maxout layer is then used to reduce the concatenated vectors to the final width. int
`attrs`	The token attributes to embed. A separate embedding table will be constructed for each attribute. List[Union[int, str]]
`rows`	The number of rows for each embedding tables. Can be low, due to the hashing trick. Recommended values are between `1000` and `10000`. The layer needs surprisingly few rows, due to its use of the hashing trick. Generally between 2000 and 10000 rows is sufficient, even for very large vocabularies. A number of rows must be specified for each table, so the `rows` list must be of the same length as the `attrs` parameter. List[int]
`include_static_vectors`	Whether to also use static word vectors. Requires a vectors table to be loaded in the `Doc` objects’ vocab. bool
CREATES	The model using the architecture. Model[List[Doc], List[Floats2d]]

spacy.CharacterEmbed.v2

Construct an embedded representation based on character embeddings, using a feed-forward network. A fixed number of UTF-8 byte characters are used for each word, taken from the beginning and end of the word equally. Padding is used in the center for words that are too short.

For instance, let’s say nC=4, and the word is “jumping”. The characters used will be "jung" (two from the start, two from the end). If we had nC=8, the characters would be "jumpping": 4 from the start, 4 from the end. This ensures that the final character is always in the last position, instead of being in an arbitrary position depending on the word length.

The characters are embedded in a embedding table with a given number of rows, and the vectors concatenated. A hash-embedded vector of the NORM of the word is also concatenated on, and the result is then passed through a feed-forward network to construct a single vector to represent the information.

Name	Description
`width`	The width of the output vector and the `NORM` hash embedding. int
`rows`	The number of rows in the `NORM` hash embedding table. int
`nM`	The dimensionality of the character embeddings. Recommended values are between `16` and `64`. int
`nC`	The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. int
CREATES	The model using the architecture. Model[List[Doc], List[Floats2d]]

spacy.MaxoutWindowEncoder.v2

Encode context using convolutions with maxout activation, layer normalization and residual connections.

Name	Description
`width`	The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. int
`window_size`	The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. int
`maxout_pieces`	The number of maxout pieces to use. Recommended values are `2` or `3`. int
`depth`	The number of convolutional layers. Recommended value is `4`. int
CREATES	The model using the architecture. Model[List[Floats2d], List[Floats2d]]

spacy.MishWindowEncoder.v2

Encode context using convolutions with Mish activation, layer normalization and residual connections.

Name	Description
`width`	The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. int
`window_size`	The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. int
`depth`	The number of convolutional layers. Recommended value is `4`. int
CREATES	The model using the architecture. Model[List[Floats2d], List[Floats2d]]

spacy.TorchBiLSTMEncoder.v1

Encode context using bidirectional LSTM layers. Requires PyTorch.

Name	Description
`width`	The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. int
`depth`	The number of recurrent layers, for instance `depth=2` results in stacking two LSTMs together. int
`dropout`	Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. float
CREATES	The model using the architecture. Model[List[Floats2d], List[Floats2d]]

spacy.StaticVectors.v2

Embed Doc objects with their vocab’s vectors table, applying a learned linear projection to control the dimensionality. Unknown tokens are mapped to a zero vector. See the documentation on static vectors for details.

Name	Description
`nO`	The output width of the layer, after the linear projection. Optional[int]
`nM`	The width of the static vectors. Optional[int]
`dropout`	Optional dropout rate. If set, it’s applied per dimension over the whole batch. Defaults to `None`. Optional[float]
`init_W`	The initialization function. Defaults to `glorot_uniform_init`. Callable[[Ops, Tuple[int, …]]],FloatsXd]
`key_attr`	This setting is ignored in spaCy v3.6+. To set a custom key attribute for vectors, configure it through `Vectors` or `spacy init vectors`. Defaults to `"ORTH"`. str
CREATES	The model using the architecture. Model[List[Doc],Ragged]

spacy.FeatureExtractor.v1

Extract arrays of input features from Doc objects. Expects a list of feature names to extract, which should refer to token attributes.

Name	Description
`columns`	The token attributes to extract. List[Union[int, str]]
CREATES	The created feature extraction layer. Model[List[Doc], List[Ints2d]]

Transformer architectures
Source

The following architectures are provided by the package spacy-transformers. See the usage documentation for how to integrate the architectures into your training config.

spacy-transformers.TransformerModel.v3

Load and wrap a transformer model from the HuggingFace transformers library. You can use any transformer that has pretrained weights and a PyTorch implementation. The name variable is passed through to the underlying library, so it can be either a string or a path. If it’s a string, the pretrained weights will be downloaded via the transformers library if they are not already available locally.

In order to support longer documents, the TransformerModel layer allows you to pass in a get_spans function that will divide up the Doc objects before passing them through the transformer. Your spans are allowed to overlap or exclude tokens. This layer is usually used directly by the Transformer component, which allows you to share the transformer weights across your pipeline. For a layer that’s configured for use in other components, see Tok2VecTransformer.

Name	Description
`name`	Any model name that can be loaded by `transformers.AutoModel`. str
`get_spans`	Function that takes a batch of `Doc` object and returns lists of `Span` objects to process by the transformer. See here for built-in options and examples. Callable[[List[Doc]], List[Span]]
`tokenizer_config`	Tokenizer settings passed to `transformers.AutoTokenizer`. Dict[str, Any]
`transformer_config`	Transformer settings passed to `transformers.AutoConfig` Dict[str, Any]
`mixed_precision`	Replace whitelisted ops by half-precision counterparts. Speeds up training and prediction on GPUs with Tensor Cores and reduces GPU memory use. bool
`grad_scaler_config`	Configuration to pass to `thinc.api.PyTorchGradScaler` during training when `mixed_precision` is enabled. Dict[str, Any]
CREATES	The model using the architecture. Model[List[Doc],FullTransformerBatch]

The transformer_config argument was added in spacy-transformers.TransformerModel.v2.
The mixed_precision and grad_scaler_config arguments were added in spacy-transformers.TransformerModel.v3.

The other arguments are shared between all versions.

spacy-transformers.TransformerListener.v1

Create a TransformerListener layer, which will connect to a Transformer component earlier in the pipeline. The layer takes a list of Doc objects as input, and produces a list of 2-dimensional arrays as output, with each array having one row per token. Most spaCy models expect a sublayer with this signature, making it easy to connect them to a transformer model via this sublayer. Transformer models usually operate over wordpieces, which usually don’t align one-to-one against spaCy tokens. The layer therefore requires a reduction operation in order to calculate a single token vector given zero or more wordpiece vectors.

Name	Description
`pooling`	A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see `reduce_mean`) is usually a good choice. Model[Ragged,Floats2d]
`grad_factor`	Reweight gradients from the component before passing them upstream. You can set this to `0` to “freeze” the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. float
`upstream`	A string to identify the “upstream” `Transformer` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Transformer` component. You’ll almost never have multiple upstream `Transformer` components, so the wildcard string will almost always be fine. str
CREATES	The model using the architecture. Model[List[Doc], List[Floats2d]]

spacy-transformers.Tok2VecTransformer.v3

Use a transformer as a Tok2Vec layer directly. This does not allow multiple components to share the transformer weights and does not allow the transformer to set annotations into the Doc object, but it’s a simpler solution if you only need the transformer within one component.

Name	Description
`get_spans`	Function that takes a batch of `Doc` object and returns lists of `Span` objects to process by the transformer. See here for built-in options and examples. Callable[[List[Doc]], List[Span]]
`tokenizer_config`	Tokenizer settings passed to `transformers.AutoTokenizer`. Dict[str, Any]
`transformer_config`	Settings to pass to the transformers forward pass. Dict[str, Any]
`pooling`	A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see `reduce_mean`) is usually a good choice. Model[Ragged,Floats2d]
`grad_factor`	Reweight gradients from the component before passing them upstream. You can set this to `0` to “freeze” the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. float
`mixed_precision`	Replace whitelisted ops by half-precision counterparts. Speeds up training and prediction on GPUs with Tensor Cores and reduces GPU memory use. bool
`grad_scaler_config`	Configuration to pass to `thinc.api.PyTorchGradScaler` during training when `mixed_precision` is enabled. Dict[str, Any]
CREATES	The model using the architecture. Model[List[Doc], List[Floats2d]]

The transformer_config argument was added in spacy-transformers.Tok2VecTransformer.v2.
The mixed_precision and grad_scaler_config arguments were added in spacy-transformers.Tok2VecTransformer.v3.

The other arguments are shared between all versions.

Curated Transformer architectures
Source

The following architectures are provided by the package spacy-curated-transformers. See the usage documentation for how to integrate the architectures into your training config.

When loading the model from the Hugging Face Hub, the model config’s parameters must be same as the hyperparameters used by the pre-trained model. The init fill-curated-transformer CLI command can be used to automatically fill in these values.

spacy-curated-transformers.AlbertTransformer.v1

Construct an ALBERT transformer model.

Name	Description
`vocab_size`	Vocabulary size. int
`with_spans`	Callback that constructs a span generator model. Callable
`piece_encoder`	The piece encoder to segment input tokens. Model
`attention_probs_dropout_prob`	Dropout probability of the self-attention layers. float
`embedding_width`	Width of the embedding representations. int
`hidden_act`	Activation used by the point-wise feed-forward layers. str
`hidden_dropout_prob`	Dropout probability of the point-wise feed-forward and embedding layers. float
`hidden_width`	Width of the final representations. int
`intermediate_width`	Width of the intermediate projection layer in the point-wise feed-forward layer. int
`layer_norm_eps`	Epsilon for layer normalization. float
`max_position_embeddings`	Maximum length of position embeddings. int
`model_max_length`	Maximum length of model inputs. int
`num_attention_heads`	Number of self-attention heads. int
`num_hidden_groups`	Number of layer groups whose constituents share parameters. int
`num_hidden_layers`	Number of hidden layers. int
`padding_idx`	Index of the padding meta-token. int
`type_vocab_size`	Type vocabulary size. int
`mixed_precision`	Use mixed-precision training. bool
`grad_scaler_config`	Configuration passed to the PyTorch gradient scaler. dict
CREATES	The model using the architecture Model

spacy-curated-transformers.BertTransformer.v1

Construct a BERT transformer model.

Name	Description
`vocab_size`	Vocabulary size. int
`with_spans`	Callback that constructs a span generator model. Callable
`piece_encoder`	The piece encoder to segment input tokens. Model
`attention_probs_dropout_prob`	Dropout probability of the self-attention layers. float
`hidden_act`	Activation used by the point-wise feed-forward layers. str
`hidden_dropout_prob`	Dropout probability of the point-wise feed-forward and embedding layers. float
`hidden_width`	Width of the final representations. int
`intermediate_width`	Width of the intermediate projection layer in the point-wise feed-forward layer. int
`layer_norm_eps`	Epsilon for layer normalization. float
`max_position_embeddings`	Maximum length of position embeddings. int
`model_max_length`	Maximum length of model inputs. int
`num_attention_heads`	Number of self-attention heads. int
`num_hidden_layers`	Number of hidden layers. int
`padding_idx`	Index of the padding meta-token. int
`type_vocab_size`	Type vocabulary size. int
`mixed_precision`	Use mixed-precision training. bool
`grad_scaler_config`	Configuration passed to the PyTorch gradient scaler. dict
CREATES	The model using the architecture Model

spacy-curated-transformers.CamembertTransformer.v1

Construct a CamemBERT transformer model.

Name	Description
`vocab_size`	Vocabulary size. int
`with_spans`	Callback that constructs a span generator model. Callable
`piece_encoder`	The piece encoder to segment input tokens. Model
`attention_probs_dropout_prob`	Dropout probability of the self-attention layers. float
`hidden_act`	Activation used by the point-wise feed-forward layers. str
`hidden_dropout_prob`	Dropout probability of the point-wise feed-forward and embedding layers. float
`hidden_width`	Width of the final representations. int
`intermediate_width`	Width of the intermediate projection layer in the point-wise feed-forward layer. int
`layer_norm_eps`	Epsilon for layer normalization. float
`max_position_embeddings`	Maximum length of position embeddings. int
`model_max_length`	Maximum length of model inputs. int
`num_attention_heads`	Number of self-attention heads. int
`num_hidden_layers`	Number of hidden layers. int
`padding_idx`	Index of the padding meta-token. int
`type_vocab_size`	Type vocabulary size. int
`mixed_precision`	Use mixed-precision training. bool
`grad_scaler_config`	Configuration passed to the PyTorch gradient scaler. dict
CREATES	The model using the architecture Model

spacy-curated-transformers.RobertaTransformer.v1

Construct a RoBERTa transformer model.

Name	Description
`vocab_size`	Vocabulary size. int
`with_spans`	Callback that constructs a span generator model. Callable
`piece_encoder`	The piece encoder to segment input tokens. Model
`attention_probs_dropout_prob`	Dropout probability of the self-attention layers. float
`hidden_act`	Activation used by the point-wise feed-forward layers. str
`hidden_dropout_prob`	Dropout probability of the point-wise feed-forward and embedding layers. float
`hidden_width`	Width of the final representations. int
`intermediate_width`	Width of the intermediate projection layer in the point-wise feed-forward layer. int
`layer_norm_eps`	Epsilon for layer normalization. float
`max_position_embeddings`	Maximum length of position embeddings. int
`model_max_length`	Maximum length of model inputs. int
`num_attention_heads`	Number of self-attention heads. int
`num_hidden_layers`	Number of hidden layers. int
`padding_idx`	Index of the padding meta-token. int
`type_vocab_size`	Type vocabulary size. int
`mixed_precision`	Use mixed-precision training. bool
`grad_scaler_config`	Configuration passed to the PyTorch gradient scaler. dict
CREATES	The model using the architecture Model

spacy-curated-transformers.XlmrTransformer.v1

Construct a XLM-RoBERTa transformer model.

Name	Description
`vocab_size`	Vocabulary size. int
`with_spans`	Callback that constructs a span generator model. Callable
`piece_encoder`	The piece encoder to segment input tokens. Model
`attention_probs_dropout_prob`	Dropout probability of the self-attention layers. float
`hidden_act`	Activation used by the point-wise feed-forward layers. str
`hidden_dropout_prob`	Dropout probability of the point-wise feed-forward and embedding layers. float
`hidden_width`	Width of the final representations. int
`intermediate_width`	Width of the intermediate projection layer in the point-wise feed-forward layer. int
`layer_norm_eps`	Epsilon for layer normalization. float
`max_position_embeddings`	Maximum length of position embeddings. int
`model_max_length`	Maximum length of model inputs. int
`num_attention_heads`	Number of self-attention heads. int
`num_hidden_layers`	Number of hidden layers. int
`padding_idx`	Index of the padding meta-token. int
`type_vocab_size`	Type vocabulary size. int
`mixed_precision`	Use mixed-precision training. bool
`grad_scaler_config`	Configuration passed to the PyTorch gradient scaler. dict
CREATES	The model using the architecture Model

spacy-curated-transformers.ScalarWeight.v1

Construct a model that accepts a list of transformer layer outputs and returns a weighted representation of the same.

Name	Description
`num_layers`	Number of transformer hidden layers. int
`dropout_prob`	Dropout probability. float
`mixed_precision`	Use mixed-precision training. bool
`grad_scaler_config`	Configuration passed to the PyTorch gradient scaler. dict
CREATES	The model using the architecture Model[ScalarWeightInT, ScalarWeightOutT]

spacy-curated-transformers.TransformerLayersListener.v1

Construct a listener layer that communicates with one or more upstream Transformer components. This layer extracts the output of the last transformer layer and performs pooling over the individual pieces of each Doc token, returning their corresponding representations. The upstream name should either be the wildcard string ’*’, or the name of the Transformer component.

In almost all cases, the wildcard string will suffice as there’ll only be one upstream Transformer component. But in certain situations, e.g: you have disjoint datasets for certain tasks, or you’d like to use a pre-trained pipeline but a downstream task requires its own token representations, you could end up with more than one Transformer component in the pipeline.

Name	Description
`layers`	The number of layers produced by the upstream transformer component, excluding the embedding layer. int
`width`	The width of the vectors produced by the upstream transformer component. int
`pooling`	Model that is used to perform pooling over the piece representations. Model
`upstream_name`	A string to identify the ‘upstream’ Transformer component to communicate with. str
`grad_factor`	Factor to multiply gradients with. float
CREATES	A model that returns the relevant vectors from an upstream transformer component. Model[List[Doc], List[Floats2d]]

spacy-curated-transformers.LastTransformerLayerListener.v1

Construct a listener layer that communicates with one or more upstream Transformer components. This layer extracts the output of the last transformer layer and performs pooling over the individual pieces of each Doc token, returning their corresponding representations. The upstream name should either be the wildcard string ’*’, or the name of the Transformer component.

Name	Description
`width`	The width of the vectors produced by the upstream transformer component. int
`pooling`	Model that is used to perform pooling over the piece representations. Model
`upstream_name`	A string to identify the ‘upstream’ Transformer component to communicate with. str
`grad_factor`	Factor to multiply gradients with. float
CREATES	A model that returns the relevant vectors from an upstream transformer component. Model[List[Doc], List[Floats2d]]

spacy-curated-transformers.ScalarWeightingListener.v1

Construct a listener layer that communicates with one or more upstream Transformer components. This layer calculates a weighted representation of all transformer layer outputs and performs pooling over the individual pieces of each Doc token, returning their corresponding representations.

Requires its upstream Transformer components to return all layer outputs from their models. The upstream name should either be the wildcard string ’*’, or the name of the Transformer component.

Name	Description
`width`	The width of the vectors produced by the upstream transformer component. int
`weighting`	Model that is used to perform the weighting of the different layer outputs. Model
`pooling`	Model that is used to perform pooling over the piece representations. Model
`upstream_name`	A string to identify the ‘upstream’ Transformer component to communicate with. str
`grad_factor`	Factor to multiply gradients with. float
CREATES	A model that returns the relevant vectors from an upstream transformer component. Model[List[Doc], List[Floats2d]]

spacy-curated-transformers.BertWordpieceEncoder.v1

Construct a WordPiece piece encoder model that accepts a list of token sequences or documents and returns a corresponding list of piece identifiers. This encoder also splits each token on punctuation characters, as expected by most BERT models.