Loading a language processing pipeline

The standard entry point into spaCy is the spacy.load() function, which constructs a language processing pipeline. The standard variable name for the language processing pipeline is nlp, for Natural Language Processing. The nlp variable is usually an instance of class spacy.language.Language. For English, the spacy.en.English class is the default.

You'll use the nlp instance to produce Doc objects. You'll then use the Doc object to access linguistic annotations to help you with whatever text processing task you're trying to do.

import spacy                         # See "Installing spaCy"
nlp = spacy.load('en')               # You are here.
doc = nlp(u'Hello, spacy!')          # See "Using the pipeline"
print((w.text, w.pos_) for w in doc) # See "Doc, Span and Token"

The load function takes the following positional arguments:

lang_id An ID that is resolved to a class or factory function by spacy.util.get_lang_class(). Common values are 'en' for the English pipeline, or 'de' for the German pipeline. You can register your own factory function or class with spacy.util.set_lang_class().

All keyword arguments are passed forward to the pipeline factory. No keyword arguments are required. The built-in factories (e.g. spacy.en.English, spacy.de.German), which are subclasses of Language , respond to the following keyword arguments:

path Where to load the data from. If None, the default data path is fetched via spacy.util.get_data_path(). You can configure this default using spacy.util.set_data_path(). The data path is expected to be either a string, or an object responding to the pathlib.Path interface. If the path is a string, it will be immediately transformed into a pathlib.Path object. spaCy promises to never manipulate or open file-system paths as strings. All access to the file-system is done via the pathlib.Path interface. spaCy also promises to never check the type of path objects. This allows you to customize the loading behaviours in arbitrary ways, by creating your own object that implements the pathlib.Path interface.
pipeline A sequence of functions that take the Doc object and modify it in-place. See Customizing the pipeline.
create_pipeline Callback to construct the pipeline sequence. It should accept the nlp instance as its only argument, and return a sequence of functions that take the Doc object and modify it in-place. See Customizing the pipeline. If a value is supplied to the pipeline keyword argument, the create_pipeline keyword argument is ignored.
make_docA function that takes the input and returns a document object.
create_make_doc Callback to construct the make_doc function. It should accept the nlp instance as its only argument. To use the built-in annotation processes, it should return an object of type Doc. If a value is supplied to the make_doc keyword argument, the create_make_doc keyword argument is ignored.
Supply a pre-built Vocab instance, instead of constructing one.
add_vectors Callback that installs word vectors into the Vocab instance. The add_vectors callback should take a Vocab instance as its only argument, and set the word vectors and vectors_length in-place. See Word Vectors and Similarities.
taggerSupply a pre-built tagger, instead of creating one.
parserSupply a pre-built parser, instead of creating one.
entitySupply a pre-built entity recognizer, instead of creating one.
matcherSupply a pre-built matcher, instead of creating one.
Read next: Processing text