Pipeline

Language

class
A text-processing pipeline

Usually you’ll load this once per process as nlp and pass the instance around your application. The Language class is created when you call spacy.load() and contains the shared vocabulary and language data, optional model data loaded from a model package or a path, and a processing pipeline containing components like the tagger or parser that are called on a document in order. You can also add your own processing pipeline components that take a Doc object, modify it and return it.

Language.__init__ method

Initialize a Language object.

NameTypeDescription
vocabVocabA Vocab object. If True, a vocab is created via Language.Defaults.create_vocab.
make_doccallableA function that takes text and returns a Doc object. Usually a Tokenizer.
metadictCustom meta data for the Language class. Is written to by models to add model meta data.

Language.__call__ method

Apply the pipeline to some text. The text can span multiple sentences, and can contain arbitrary whitespace. Alignment into the original string is preserved.

NameTypeDescription
textunicodeThe text to be processed.
disablelistNames of pipeline components to disable.

Language.pipe method

Process texts as a stream, and yield Doc objects in order. This is usually more efficient than processing texts one-by-one.

NameTypeDescription
texts-A sequence of unicode objects.
as_tuplesboolIf set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False.
batch_sizeintThe number of texts to buffer.
disablelistNames of pipeline components to disable.
component_cfg v2.1dictConfig parameters for specific pipeline components, keyed by component name.

Language.update method

Update the models in the pipeline.

NameTypeDescription
docsiterableA batch of Doc objects or unicode. If unicode, a Doc object will be created from the text.
goldsiterableA batch of GoldParse objects or dictionaries. Dictionaries will be used to create GoldParse objects. For the available keys and their usage, see GoldParse.__init__.
dropfloatThe dropout rate.
sgdcallableAn optimizer.
component_cfg v2.1dictConfig parameters for specific pipeline components, keyed by component name.

Language.begin_training method

Allocate models, pre-process training data and acquire an optimizer.

NameTypeDescription
gold_tuplesiterableGold-standard training data.
component_cfg v2.1dictConfig parameters for specific pipeline components, keyed by component name.
**cfg-Config parameters (sent to all components).

Language.use_params contextmanagermethod

Replace weights of models in the pipeline with those provided in the params dictionary. Can be used as a context manager, in which case, models go back to their original weights after the block.

NameTypeDescription
paramsdictA dictionary of parameters keyed by model ID.
**cfg-Config parameters.

Language.preprocess_gold method

Can be called before training to pre-process gold data. By default, it handles nonprojectivity and adds missing tags to the tag map.

NameTypeDescription
docs_goldsiterableTuples of Doc and GoldParse objects.

Language.create_pipe methodv2.0

Create a pipeline component from a factory.

NameTypeDescription
nameunicodeFactory name to look up in Language.factories.
configdictConfiguration parameters to initialize component.

Language.add_pipe methodv2.0

Add a component to the processing pipeline. Valid components are callables that take a Doc object, modify it and return it. Only one of before, after, first or last can be set. Default behavior is last=True.

NameTypeDescription
componentcallableThe pipeline component.
nameunicodeName of pipeline component. Overwrites existing component.name attribute if available. If no name is set and the component exposes no name attribute, component.__name__ is used. An error is raised if the name already exists in the pipeline.
beforeunicodeComponent name to insert component directly before.
afterunicodeComponent name to insert component directly after:
firstboolInsert component first / not first in the pipeline.
lastboolInsert component last / not last in the pipeline.

Language.has_pipe methodv2.0

Check whether a component is present in the pipeline. Equivalent to name in nlp.pipe_names.

NameTypeDescription
nameunicodeName of the pipeline component to check.

Language.get_pipe methodv2.0

Get a pipeline component for a given component name.

NameTypeDescription
nameunicodeName of the pipeline component to get.

Language.replace_pipe methodv2.0

Replace a component in the pipeline.

NameTypeDescription
nameunicodeName of the component to replace.
componentcallableThe pipeline component to insert.

Language.rename_pipe methodv2.0

Rename a component in the pipeline. Useful to create custom names for pre-defined and pre-loaded components. To change the default name of a component added to the pipeline, you can also use the name argument on add_pipe.

NameTypeDescription
old_nameunicodeName of the component to rename.
new_nameunicodeNew name of the component.

Language.remove_pipe methodv2.0

Remove a component from the pipeline. Returns the removed component name and component function.

NameTypeDescription
nameunicodeName of the component to remove.

Language.disable_pipes contextmanagermethodv2.0

Disable one or more pipeline components. If used as a context manager, the pipeline will be restored to the initial state at the end of the block. Otherwise, a DisabledPipes object is returned, that has a .restore() method you can use to undo your changes.

NameTypeDescription
*disabledunicodeNames of pipeline components to disable.

Language.to_disk methodv2.0

Save the current state to a directory. If a model is loaded, this will include the model.

NameTypeDescription
pathunicode / PathA path to a directory, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects.
excludelistNames of pipeline components or serialization fields to exclude.

Language.from_disk methodv2.0

Loads state from a directory. Modifies the object in place and returns it. If the saved Language object contains a model, the model will be loaded. Note that this method is commonly used via the subclasses like English or German to make language-specific functionality like the lexical attribute getters available to the loaded object.

NameTypeDescription
pathunicode / PathA path to a directory. Paths may be either strings or Path-like objects.
excludelistNames of pipeline components or serialization fields to exclude.

Language.to_bytes method

Serialize the current state to a binary string.

NameTypeDescription
excludelistNames of pipeline components or serialization fields to exclude.

Language.from_bytes method

Load state from a binary string. Note that this method is commonly used via the subclasses like English or German to make language-specific functionality like the lexical attribute getters available to the loaded object.

NameTypeDescription
bytes_databytesThe data to load from.
excludelistNames of pipeline components or serialization fields to exclude.

Attributes

NameTypeDescription
vocabVocabA container for the lexical types.
tokenizerTokenizerThe tokenizer.
make_doclambda text: DocCreate a Doc object from unicode text.
pipelinelistList of (name, component) tuples describing the current processing pipeline, in order.
pipe_names v2.0listList of pipeline component names, in order.
metadictCustom meta data for the Language class. If a model is loaded, contains meta data of the model.
path v2.0PathPath to the model data directory, if a model is loaded. Otherwise None.

Class attributes

NameTypeDescription
DefaultsclassSettings, data and factory methods for creating the nlp object and processing pipeline.
langunicodeTwo-letter language ID, i.e. ISO code.
factories v2.0dictFactories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name.

Serialization fields

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

NameDescription
vocabThe shared Vocab.
tokenizerTokenization rules and exceptions.
metaThe meta data, available as Language.meta.
String names of pipeline components, e.g. "ner".