Top-level Functions
spacy.load function
Load a pipeline using the name of an installed
package, a string path or a Path
-like object.
spaCy will try resolving the load argument in this order. If a pipeline is
loaded from a string name, spaCy will assume it’s a Python package and import it
and call the package’s own load()
method. If a pipeline is loaded from a path,
spaCy will assume it’s a data directory, load its
config.cfg
and use the language and pipeline
information to construct the Language
class. The data will be loaded in via
Language.from_disk
. Loading a pipeline from a
package will also import any custom code, if present, whereas loading from a
directory does not. For these cases, you need to manually import your custom
code.
Name | Description |
---|---|
name | Pipeline to load, i.e. package name or path. Union[str,Path] |
keyword-only | |
vocab | Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool] |
disable | Name(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won’t be run unless you explicitly enable them by calling nlp.enable_pipe. Is merged with the config entry nlp.disabled . Union[str, Iterable[str]] |
enable v3.4 | Name(s) of pipeline component(s) to enable. All other pipes will be disabled. Union[str, Iterable[str]] |
exclude v3.0 | Name(s) of pipeline component(s) to exclude. Excluded components won’t be loaded. Union[str, Iterable[str]] |
config v3.0 | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. "components.name.value" . Union[Dict[str, Any],Config] |
RETURNS | A Language object with the loaded pipeline. Language |
Essentially, spacy.load()
is a convenience wrapper that reads the pipeline’s
config.cfg
, uses the language and pipeline
information to construct a Language
object, loads in the model data and
weights, and returns it.
Abstract example
spacy.blank function
Create a blank pipeline of a given language class. This function is the twin of
spacy.load()
.
Name | Description |
---|---|
name | IETF language tag, such as ‘en’, of the language class to load. str |
keyword-only | |
vocab | Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool] |
config v3.0 | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. "components.name.value" . Union[Dict[str, Any],Config] |
meta | Optional meta overrides for nlp.meta . Dict[str, Any] |
RETURNS | An empty Language object of the appropriate subclass. Language |
spacy.info function
The same as the info
command. Pretty-print information about
your installation, installed pipelines and local setup from within spaCy.
Name | Description |
---|---|
model | Optional pipeline, i.e. a package name or path (optional). Optional[str] |
keyword-only | |
markdown | Print information as Markdown. bool |
silent | Don’t print anything, just return. bool |
spacy.explain function
Get a description for a given POS tag, dependency label or entity type. For a
list of available terms, see glossary.py
.
Name | Description |
---|---|
term | Term to explain. str |
RETURNS | The explanation, or None if not found in the glossary. Optional[str] |
spacy.prefer_gpu function
Allocate data and perform operations on GPU, if available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any pipelines.
Name | Description |
---|---|
gpu_id | Device index to select. Defaults to 0 . int |
RETURNS | Whether the GPU was activated. bool |
spacy.require_gpu function
Allocate data and perform operations on GPU. Will raise an error if no GPU is available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any pipelines.
Name | Description |
---|---|
gpu_id | Device index to select. Defaults to 0 . int |
RETURNS | True bool |
spacy.require_cpu functionv3.0.0
Allocate data and perform operations on CPU. If data has already been allocated on GPU, it will not be moved. Ideally, this function should be called right after importing spaCy and before loading any pipelines.
Name | Description |
---|---|
RETURNS | True bool |
displaCy
As of v2.0, spaCy comes with a built-in visualization suite. For more info and examples, see the usage guide on visualizing spaCy.
displacy.serve method
Serve a dependency parse tree or named entity visualization to view it in your browser. Will run a simple web server.
Name | Description |
---|---|
docs | Document(s) or span(s) to visualize. Union[Iterable[Union[Doc,Span]],Doc,Span] |
style v3.3 | Visualization style, "dep" , "ent" or "span" . Defaults to "dep" . str |
page | Render markup as full HTML page. Defaults to True . bool |
minify | Minify HTML markup. Defaults to False . bool |
options | Visualizer-specific options, e.g. colors. Dict[str, Any] |
manual | Don’t parse Doc and instead expect a dict or list of dicts. See here for formats and examples. Defaults to False . bool |
port | Port to serve visualization. Defaults to 5000 . int |
host | Host to serve visualization. Defaults to "0.0.0.0" . str |
auto_select_port v3.5 | If True , automatically switch to a different port if the specified port is already in use. Defaults to False . bool |
displacy.render method
Render a dependency parse tree or named entity visualization.
Name | Description |
---|---|
docs | Document(s) or span(s) to visualize. Union[Iterable[Union[Doc,Span, dict]],Doc,Span, dict] |
style | Visualization style, "dep" , "ent" or "span" v3.3. Defaults to "dep" . str |
page | Render markup as full HTML page. Defaults to False . bool |
minify | Minify HTML markup. Defaults to False . bool |
options | Visualizer-specific options, e.g. colors. Dict[str, Any] |
manual | Don’t parse Doc and instead expect a dict or list of dicts. See here for formats and examples. Defaults to False . bool |
jupyter | Explicitly enable or disable ”Jupyter mode” to return markup ready to be rendered in a notebook. Detected automatically if None (default). Optional[bool] |
RETURNS | The rendered HTML markup. str |
displacy.parse_deps method
Generate dependency parse in {'words': [], 'arcs': []}
format. For use with
the manual=True
argument in displacy.render
.
Name | Description |
---|---|
orig_doc | Doc or span to parse dependencies. Union[Doc,Span] |
options | Dependency parse specific visualisation options. Dict[str, Any] |
RETURNS | Generated dependency parse keyed by words and arcs. dict |
displacy.parse_ents method
Generate named entities in [{start: i, end: i, label: 'label'}]
format. For
use with the manual=True
argument in displacy.render
.
Name | Description |
---|---|
doc | Doc to parse entities. Doc |
options | NER-specific visualisation options. Dict[str, Any] |
RETURNS | Generated entities keyed by text (original text) and ents. dict |
displacy.parse_spans method
Generate spans in [{start_token: i, end_token: i, label: 'label'}]
format. For
use with the manual=True
argument in displacy.render
.
Name | Description |
---|---|
doc | Doc to parse entities. Doc |
options | Span-specific visualisation options. Dict[str, Any] |
RETURNS | Generated entities keyed by text (original text) and ents. dict |
Visualizer data structures
You can use displaCy’s data format to manually render data. This can be useful if you want to visualize output from other libraries. You can find examples of displaCy’s different data formats below.
Dependency Visualizer data structure
Dictionary Key | Description |
---|---|
words | List of dictionaries describing a word token (see structure below). List[Dict[str, Any]] |
arcs | List of dictionaries describing the relations between words (see structure below). List[Dict[str, Any]] |
Optional | |
title | Title of the visualization. Optional[str] |
settings | Dependency Visualizer options (see here). Dict[str, Any] |
Dictionary Key | Description |
---|---|
text | Text content of the word. str |
tag | Fine-grained part-of-speech. str |
lemma | Base form of the word. Optional[str] |
Dictionary Key | Description |
---|---|
start | The index of the starting token. int |
end | The index of the ending token. int |
label | The type of dependency relation. str |
dir | Direction of the relation (left , right ). str |
Named Entity Recognition data structure
Dictionary Key | Description |
---|---|
text | String representation of the document text. str |
ents | List of dictionaries describing entities (see structure below). List[Dict[str, Any]] |
Optional | |
title | Title of the visualization. Optional[str] |
settings | Entity Visualizer options (see here). Dict[str, Any] |
Dictionary Key | Description |
---|---|
start | The index of the first character of the entity. int |
end | The index of the last character of the entity. (not inclusive) int |
label | Label attached to the entity. str |
Optional | |
kb_id | KnowledgeBase ID. str |
kb_url | KnowledgeBase URL. str |
Span Classification data structure
Dictionary Key | Description |
---|---|
text | String representation of the document text. str |
spans | List of dictionaries describing spans (see structure below). List[Dict[str, Any]] |
tokens | List of word tokens. List[str] |
Optional | |
title | Title of the visualization. Optional[str] |
settings | Span Visualizer options (see here). Dict[str, Any] |
Dictionary Key | Description |
---|---|
start_token | The index of the first token of the span in tokens . int |
end_token | The index of the last token of the span in tokens . int |
label | Label attached to the span. str |
Optional | |
kb_id | KnowledgeBase ID. str |
kb_url | KnowledgeBase URL. str |
Visualizer options
The options
argument lets you specify additional settings for each visualizer.
If a setting is not present in the options, the default value will be used.
Dependency Visualizer options
Name | Description |
---|---|
fine_grained | Use fine-grained part-of-speech tags (Token.tag_ ) instead of coarse-grained tags (Token.pos_ ). Defaults to False . bool |
add_lemma | Print the lemmas in a separate row below the token texts. Defaults to False . bool |
collapse_punct | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to True . bool |
collapse_phrases | Merge noun phrases into one token. Defaults to False . bool |
compact | “Compact mode” with square arrows that takes up less space. Defaults to False . bool |
color | Text color. Can be provided in any CSS legal format as a string e.g.: "#00ff00" , "rgb(0, 255, 0)" , "hsl(120, 100%, 50%)" and "green" all correspond to the color green (without transparency). Defaults to "#000000" . str |
bg | Background color. Can be provided in any CSS legal format as a string e.g.: "#00ff00" , "rgb(0, 255, 0)" , "hsl(120, 100%, 50%)" and "green" all correspond to the color green (without transparency). Defaults to "#ffffff" . str |
font | Font name or font family for all text. Defaults to "Arial" . str |
offset_x | Spacing on left side of the SVG in px. Defaults to 50 . int |
arrow_stroke | Width of arrow path in px. Defaults to 2 . int |
arrow_width | Width of arrow head in px. Defaults to 10 in regular mode and 8 in compact mode. int |
arrow_spacing | Spacing between arrows in px to avoid overlaps. Defaults to 20 in regular mode and 12 in compact mode. int |
word_spacing | Vertical spacing between words and arcs in px. Defaults to 45 . int |
distance | Distance between words in px. Defaults to 175 in regular mode and 150 in compact mode. int |
Named Entity Visualizer options
Name | Description |
---|---|
ents | Entity types to highlight or None for all types (default). Optional[List[str]] |
colors | Color overrides. Entity types should be mapped to color names or values. Dict[str, str] |
template | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use {bg} , {text} and {label} . See templates.py for examples. Optional[str] |
kb_url_template v3.2.1 | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. Optional[str] |
Span Visualizer options
Name | Description |
---|---|
spans_key | Which spans key to render spans from. Default is "sc" . str |
templates | Dictionary containing the keys "span" , "slice" , and "start" . These dictate how the overall span, a span slice, and the starting token will be rendered. Optional[Dict[str, str] |
kb_url_template | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in Optional[str] |
colors | Color overrides. Entity types should be mapped to color names or values. Dict[str, str] |
By default, displaCy comes with colors for all entity types used by
spaCy’s trained pipelines for both entity and span visualizer. If
you’re using custom entity types, you can use the colors
setting to add your
own colors for them. Your application or pipeline package can also expose a
spacy_displacy_colors
entry point
to add custom labels and their colors automatically.
By default, displaCy links to #
for entities without a kb_id
set on their
span. If you wish to link an entity to their URL then consider using the
kb_url_template
option from above. For example if the kb_id
on a span is
Q95
and this is a Wikidata identifier then this option can be set to
https://www.wikidata.org/wiki/{}
. Clicking on your entity in the rendered HTML
should redirect you to their Wikidata page, in this case
https://www.wikidata.org/wiki/Q95
.
registry v3.0
spaCy’s function registry extends
Thinc’s registry
and allows you
to map strings to functions. You can register functions to create architectures,
optimizers, schedules and more, and then refer to them and set their arguments
in your config file. Python type hints are used to
validate the inputs. See the
Thinc docs for details on the
registry
methods and our helper library
catalogue
for some background on the
concept of function registries. spaCy also uses the function registry for
language subclasses, model architecture, lookups and pipeline component
factories.
Registry name | Description |
---|---|
architectures | Registry for functions that create model architectures. Can be used to register custom model architectures and reference them in the config.cfg . |
augmenters | Registry for functions that create data augmentation callbacks for corpora and other training data iterators. |
batchers | Registry for training and evaluation data batchers. |
callbacks | Registry for custom callbacks to modify the nlp object before training. |
displacy_colors | Registry for custom color scheme for the displacy NER visualizer. Automatically reads from entry points. |
factories | Registry for functions that create pipeline components. Added automatically when you use the @spacy.component decorator and also reads from entry points. |
initializers | Registry for functions that create initializers. |
languages | Registry for language-specific Language subclasses. Automatically reads from entry points. |
layers | Registry for functions that create layers. |
loggers | Registry for functions that log training results. |
lookups | Registry for large lookup tables available via vocab.lookups . |
losses | Registry for functions that create losses. |
misc | Registry for miscellaneous functions that return data assets, knowledge bases or anything else you may need. |
optimizers | Registry for functions that create optimizers. |
readers | Registry for file and data readers, including training and evaluation data readers like Corpus . |
schedules | Registry for functions that create schedules. |
scorers | Registry for functions that create scoring methods for user with the Scorer . Scoring methods are called with Iterable[Example] and arbitrary **kwargs and return scores as Dict[str, Any] . |
tokenizers | Registry for tokenizer factories. Registered functions should return a callback that receives the nlp object and returns a Tokenizer or a custom callable. |
spacy-transformers registry
The following registries are added by the
spacy-transformers
package.
See the Transformer
API reference and
usage docs for details.
Registry name | Description |
---|---|
span_getters | Registry for functions that take a batch of Doc objects and return a list of Span objects to process by the transformer, e.g. sentences. |
annotation_setters | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of Doc objects and a FullTransformerBatch and can set additional annotations on the Doc . |
Loggers v3.0
A logger records the training results. When a logger is created, two functions
are returned: one for logging the information for each training step, and a
second function that is called to finalize the logging when the training is
finished. To log each training step, a
dictionary is passed on from the
spacy train
, including information such as the training loss
and the accuracy scores on the development set.
The built-in, default logger is the ConsoleLogger, which prints results to the
console in tabular format and saves them to a jsonl
file. The
spacy-loggers package, included as
a dependency of spaCy, enables other loggers, such as one that sends results to
a Weights & Biases dashboard.
Instead of using one of the built-in loggers, you can implement your own.
spacy.ConsoleLogger.v2 registered function
Writes the results of a training step to the console in a tabular format and
saves them to a jsonl
file.
Note that the cumulative loss keeps increasing within one epoch, but should start decreasing across epochs.
Name | Description |
---|---|
progress_bar | Whether the logger should print a progress bar tracking the steps till the next evaluation pass (default: False ). bool |
console_output | Whether the logger should print the logs in the console (default: True ). bool |
output_file | The file to save the training logs to (default: None ). Optional[Union[str,Path]] |
spacy.ConsoleLogger.v3 registered function
Writes the results of a training step to the console in a tabular format and
optionally saves them to a jsonl
file.
Name | Description |
---|---|
progress_bar | Type of progress bar to show in the console: "train" , "eval" or None . |
The bar tracks the number of steps until training.max_steps and training.eval_frequency are reached respectively (default: None ). Optional[str] | |
console_output | Whether the logger should print the logs in the console (default: True ). bool |
output_file | The file to save the training logs to (default: None ). Optional[Union[str,Path]] |
Readers
File readers v3.0
The following file readers are provided by our serialization library
srsly
. All registered functions take one
argument path
, pointing to the file path to load.
Name | Description |
---|---|
srsly.read_json.v1 | Read data from a JSON file. |
srsly.read_jsonl.v1 | Read data from a JSONL (newline-delimited JSON) file. |
srsly.read_yaml.v1 | Read data from a YAML file. |
srsly.read_msgpack.v1 | Read data from a binary MessagePack file. |
spacy.read_labels.v1 registered function
Read a JSON-formatted labels file generated with
init labels
. Typically used in the
[initialize]
block of the training
config to speed up the model initialization process and provide pre-generated
label sets.
Name | Description |
---|---|
path | The path to the labels file generated with init labels . Path |
require | Whether to require the file to exist. If set to False and the labels file doesn’t exist, the loader will return None and the initialize method will extract the labels from the data. Defaults to False . bool |
CREATES | The list of labels. List[str] |
Corpus readers v3.0
Corpus readers are registered functions that load data and return a function
that takes the current nlp
object and yields Example
objects
that can be used for training and
pretraining. You can replace it
with your own registered function in the
@readers
registry to customize the data loading and
streaming.
spacy.Corpus.v1 registered function
The Corpus
reader manages annotated corpora and can be used for training and
development datasets in the DocBin (.spacy
) format. Also see
the Corpus
class.
Name | Description |
---|---|
path | The directory or filename to read from. Expects data in spaCy’s binary .spacy format. Union[str,Path] |
gold_preproc | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See Corpus for details. bool |
max_length | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. int |
limit | Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int |
augmenter | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don’t have smart-quotes, or only have smart quotes, etc. Defaults to None . Optional[Callable] |
CREATES | The corpus reader. Corpus |
spacy.JsonlCorpus.v1 registered function
Create Example
objects from a JSONL (newline-delimited JSON)
file of texts keyed by "text"
. Can be used to read the raw text corpus for
language model pretraining from a
JSONL file. Also see the JsonlCorpus
class.
Name | Description |
---|---|
path | The directory or filename to read from. Expects newline-delimited JSON with a key "text" for each record. Union[str,Path] |
min_length | Minimum document length (in tokens). Shorter documents will be skipped. Defaults to 0 , which indicates no limit. int |
max_length | Maximum document length (in tokens). Longer documents will be skipped. Defaults to 0 , which indicates no limit. int |
limit | Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int |
CREATES | The corpus reader. JsonlCorpus |
Batchers v3.0
A data batcher implements a batching strategy that essentially turns a stream of
items into a stream of batches, with each batch consisting of one item or a list
of items. During training, the models update their weights after processing one
batch at a time. Typical batching strategies include presenting the training
data as a stream of batches with similar sizes, or with increasing batch sizes.
See the Thinc documentation on
schedules
for a few standard examples.
Instead of using one of the built-in batchers listed here, you can also implement your own, which may or may not use a custom schedule.
spacy.batch_by_words.v1 registered function
Create minibatches of roughly a given number of words. If any examples are
longer than the specified batch length, they will appear in a batch by
themselves, or be discarded if discard_oversize
is set to True
. The argument
docs
can be a list of strings, Doc
objects or
Example
objects.
Name | Description |
---|---|
seqs | The sequences to minibatch. Iterable[Any] |
size | The target number of words per batch. Can also be a block referencing a schedule, e.g. compounding . Union[int, Sequence[int]] |
tolerance | What percentage of the size to allow batches to exceed. float |
discard_oversize | Whether to discard sequences that by themselves exceed the tolerated size. bool |
get_length | Optional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. Optional[Callable[[Any], int]] |
CREATES | The batcher that takes an iterable of items and returns batches. Callable[[Iterable[Any]], Iterable[List[Any]]] |
spacy.batch_by_sequence.v1 registered function
Create a batcher that creates batches of the specified size.
Name | Description |
---|---|
size | The target number of items per batch. Can also be a block referencing a schedule, e.g. compounding . Union[int, Sequence[int]] |
get_length | Optional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. Optional[Callable[[Any], int]] |
CREATES | The batcher that takes an iterable of items and returns batches. Callable[[Iterable[Any]], Iterable[List[Any]]] |
spacy.batch_by_padded.v1 registered function
Minibatch a sequence by the size of padded batches that would result, with sequences binned by length within a window. The padded size is defined as the maximum length of sequences within the batch multiplied by the number of sequences in the batch.
Name | Description |
---|---|
size | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. compounding . Union[int, Sequence[int]] |
buffer | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. int |
discard_oversize | Whether to discard sequences that are by themselves longer than the largest padded batch size. bool |
get_length | Optional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. Optional[Callable[[Any], int]] |
CREATES | The batcher that takes an iterable of items and returns batches. Callable[[Iterable[Any]], Iterable[List[Any]]] |
Augmenters v3.0
Data augmentation is the process of applying small modifications to the training data. It can be especially useful for punctuation and case replacement – for example, if your corpus only uses smart quotes and you want to include variations using regular quotes, or to make the model less sensitive to capitalization by including a mix of capitalized and lowercase examples. See the usage guide for details and examples.
spacy.orth_variants.v1 registered function
Create a data augmentation callback that uses orth-variant replacement. The callback can be added to a corpus or other data iterator during training. It’s especially useful for punctuation and case replacement, to help generalize beyond corpora that don’t have smart quotes, or only have smart quotes etc.
Name | Description |
---|---|
level | The percentage of texts that will be augmented. float |
lower | The percentage of texts that will be lowercased. float |
orth_variants | A dictionary containing the single and paired orth variants. Typically loaded from a JSON file. See en_orth_variants.json for an example. Dict[str, Dict[List[Union[str, List[str]]]]] |
CREATES | A function that takes the current nlp object and an Example and yields augmented Example objects. Callable[[Language,Example], Iterator[Example]] |
spacy.lower_case.v1 registered function
Create a data augmentation callback that lowercases documents. The callback can be added to a corpus or other data iterator during training. It’s especially useful for making the model less sensitive to capitalization.
Name | Description |
---|---|
level | The percentage of texts that will be augmented. float |
CREATES | A function that takes the current nlp object and an Example and yields augmented Example objects. Callable[[Language,Example], Iterator[Example]] |
Callbacks v3.0
The config supports callbacks at
several points in the lifecycle that can be used modify the nlp
object.
spacy.copy_from_base_model.v1 registered function
Copy the tokenizer and/or vocab from the specified models. It’s similar to the
v2 base model option and useful in
combination with
sourced components when
fine-tuning an existing pipeline. The vocab includes the lookups and the vectors
from the specified model. Intended for use in [initialize.before_init]
.
Name | Description |
---|---|
tokenizer | The pipeline to copy the tokenizer from. Defaults to None . Optional[str] |
vocab | The pipeline to copy the vocab from. The vocab includes the lookups and vectors. Defaults to None . Optional[str] |
CREATES | A function that takes the current nlp object and modifies its tokenizer and vocab . Callable[[Language], None] |
spacy.models_with_nvtx_range.v1 registered function
Recursively wrap the models in each pipe using NVTX range markers. These markers aid in GPU profiling by attributing specific operations to a Model’s forward or backprop passes.
Name | Description |
---|---|
forward_color | Color identifier for forward passes. Defaults to -1 . int |
backprop_color | Color identifier for backpropagation passes. Defaults to -1 . int |
CREATES | A function that takes the current nlp and wraps forward/backprop passes in NVTX ranges. Callable[[Language],Language] |
spacy.models_and_pipes_with_nvtx_range.v1 registered functionv3.4
Recursively wrap both the models and methods of each pipe using
NVTX range markers. By default, the following
methods are wrapped: pipe
, predict
, set_annotations
, update
, rehearse
,
get_loss
, initialize
, begin_update
, finish_update
, update
.
Name | Description |
---|---|
forward_color | Color identifier for model forward passes. Defaults to -1 . int |
backprop_color | Color identifier for model backpropagation passes. Defaults to -1 . int |
additional_pipe_functions | Additional pipeline methods to wrap. Keys are pipeline names and values are lists of method identifiers. Defaults to None . Optional[Dict[str, List[str]]] |
CREATES | A function that takes the current nlp and wraps pipe models and methods in NVTX ranges. Callable[[Language],Language] |
Training data and alignment
training.offsets_to_biluo_tags function
Encode labelled spans into per-token tags, using the
BILUO scheme (Begin, In, Last, Unit,
Out). Returns a list of strings, describing the tags. Each tag string will be in
the form of either ""
, "O"
or "{action}-{label}"
, where action is one of
"B"
, "I"
, "L"
, "U"
. The string "-"
is used where the entity offsets
don’t align with the tokenization in the Doc
object. The training algorithm
will view these as missing values. O
denotes a non-entity token. B
denotes
the beginning of a multi-token entity, I
the inside of an entity of three or
more tokens, and L
the end of an entity of two or more tokens. U
denotes a
single-token entity.
Name | Description |
---|---|
doc | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. Doc |
entities | A sequence of (start, end, label) triples. start and end should be character-offset integers denoting the slice into the original string. List[Tuple[int, int, Union[str, int]]] |
missing | The label used for missing values, e.g. if tokenization doesn’t align with the entity offsets. Defaults to "O" . str |
RETURNS | A list of strings, describing the BILUO tags. List[str] |
training.biluo_tags_to_offsets function
Encode per-token tags following the BILUO scheme into entity offsets.
Name | Description |
---|---|
doc | The document that the BILUO tags refer to. Doc |
tags | A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "" , "O" or "{action}-{label}" , where action is one of "B" , "I" , "L" , "U" . List[str] |
RETURNS | A sequence of (start, end, label) triples. start and end will be character-offset integers denoting the slice into the original string. List[Tuple[int, int, str]] |
training.biluo_tags_to_spans function
Encode per-token tags following the
BILUO scheme into
Span
objects. This can be used to create entity spans from
token-based tags, e.g. to overwrite the doc.ents
.
Name | Description |
---|---|
doc | The document that the BILUO tags refer to. Doc |
tags | A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "" , "O" or "{action}-{label}" , where action is one of "B" , "I" , "L" , "U" . List[str] |
RETURNS | A sequence of Span objects with added entity labels. List[Span] |
training.biluo_to_iob function
Convert a sequence of BILUO tags to IOB tags. This is useful if you want use the BILUO tags with a model that only supports IOB tags.
Name | Description |
---|---|
tags | A sequence of BILUO tags. Iterable[str] |
RETURNS | A list of IOB tags. List[str] |
training.iob_to_biluo function
Convert a sequence of IOB tags to BILUO tags. This is useful if you want use the IOB tags with a model that only supports BILUO tags.
Name | Description |
---|---|
tags | A sequence of IOB tags. Iterable[str] |
RETURNS | A list of BILUO tags. List[str] |
training.biluo_to_iob function
Convert a sequence of BILUO tags to IOB tags. This is useful if you want use the BILUO tags with a model that only supports IOB tags.
Name | Description |
---|---|
tags | A sequence of BILUO tags. Iterable[str] |
RETURNS | A list of IOB tags. List[str] |
training.iob_to_biluo function
Convert a sequence of IOB tags to BILUO tags. This is useful if you want use the IOB tags with a model that only supports BILUO tags.
Name | Description |
---|---|
tags | A sequence of IOB tags. Iterable[str] |
RETURNS | A list of BILUO tags. List[str] |
Utility functions
spaCy comes with a small collection of utility functions located in
spacy/util.py
. Because utility functions are
mostly intended for internal use within spaCy, their behavior may change
with future releases. The functions documented on this page should be safe to
use and we’ll try to ensure backwards compatibility. However, we recommend
having additional tests in place if your application depends on any of spaCy’s
utilities.
util.get_lang_class function
Import and load a Language
class. Allows lazy-loading
language data and importing
languages using the two-letter language code. To add a language code for a
custom language class, you can register it using the
@registry.languages
decorator.
Name | Description |
---|---|
lang | Two-letter language code, e.g. "en" . str |
RETURNS | The respective subclass. Language |
util.lang_class_is_loaded function
Check whether a Language
subclass is already loaded. Language
subclasses are
loaded lazily to avoid expensive setup code associated with the language data.
Name | Description |
---|---|
name | Two-letter language code, e.g. "en" . str |
RETURNS | Whether the class has been loaded. bool |
util.load_model function
Load a pipeline from a package or data path. If called with a string name, spaCy
will assume the pipeline is a Python package and import and call its load()
method. If called with a path, spaCy will assume it’s a data directory, read the
language and pipeline settings from the config.cfg
and create a Language
object. The model data will then be loaded in via
Language.from_disk
.
Name | Description |
---|---|
name | Package name or path. str |
keyword-only | |
vocab | Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool] |
disable | Name(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won’t be run unless you explicitly enable them by calling nlp.enable_pipe . Union[str, Iterable[str]] |
enable v3.4 | Name(s) of pipeline component(s) to enable. All other pipes will be disabled, but can be enabled again using nlp.enable_pipe . Union[str, Iterable[str]] |
exclude | Name(s) of pipeline component(s) to exclude. Excluded components won’t be loaded. Union[str, Iterable[str]] |
config v3.0 | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. "nlp.pipeline" . Union[Dict[str, Any],Config] |
RETURNS | Language class with the loaded pipeline. Language |
util.load_model_from_init_py function
A helper function to use in the load()
method of a pipeline package’s
__init__.py
.
Name | Description |
---|---|
init_file | Path to package’s __init__.py , i.e. __file__ . Union[str,Path] |
keyword-only | |
vocab v3.0 | Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool] |
disable | Name(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won’t be run unless you explicitly enable them by calling nlp.enable_pipe . Union[str, Iterable[str]] |
enable v3.4 | Name(s) of pipeline component(s) to enable. All other pipes will be disabled, but can be enabled again using nlp.enable_pipe . Union[str, Iterable[str]] |
exclude v3.0 | Name(s) of pipeline component(s) to exclude. Excluded components won’t be loaded. Union[str, Iterable[str]] |
config v3.0 | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. "nlp.pipeline" . Union[Dict[str, Any],Config] |
RETURNS | Language class with the loaded pipeline. Language |
util.load_config functionv3.0
Load a pipeline’s config.cfg
from a file path. The
config typically includes details about the components and how they’re created,
as well as all training settings and hyperparameters.
Name | Description |
---|---|
path | Path to the pipeline’s config.cfg . Union[str,Path] |
overrides | Optional config overrides to replace in loaded config. Can be provided as nested dict, or as flat dict with keys in dot notation, e.g. "nlp.pipeline" . Dict[str, Any] |
interpolate | Whether to interpolate the config and replace variables like ${paths.train} with their values. Defaults to False . bool |
RETURNS | The pipeline’s config. Config |
util.load_meta functionv3.0
Get a pipeline’s meta.json
from a file path and
validate its contents. The meta typically includes details about author,
licensing, data sources and version.
Name | Description |
---|---|
path | Path to the pipeline’s meta.json . Union[str,Path] |
RETURNS | The pipeline’s meta data. Dict[str, Any] |
util.get_installed_models functionv3.0
List all pipeline packages installed in the current environment. This will
include any spaCy pipeline that was packaged with
spacy package
. Under the hood, pipeline packages expose a
Python entry point that spaCy can check, without having to load the nlp
object.
Name | Description |
---|---|
RETURNS | The string names of the pipelines installed in the current environment. List[str] |
util.is_package function
Check if string maps to a package installed via pip. Mainly used to validate pipeline packages.
Name | Description |
---|---|
name | Name of package. str |
RETURNS | True if installed package, False if not. bool |
util.get_package_path function
Get path to an installed package. Mainly used to resolve the location of pipeline packages. Currently imports the package to find its path.
Name | Description |
---|---|
package_name | Name of installed package. str |
RETURNS | Path to pipeline package directory. Path |
util.is_in_jupyter function
Check if user is running spaCy from a Jupyter notebook by
detecting the IPython kernel. Mainly used for the
displacy
visualizer.
Name | Description |
---|---|
RETURNS | True if in Jupyter, False if not. bool |
util.compile_prefix_regex function
Compile a sequence of prefix rules into a regex object.
Name | Description |
---|---|
entries | The prefix rules, e.g. lang.punctuation.TOKENIZER_PREFIXES . Iterable[Union[str,Pattern]] |
RETURNS | The regex object to be used for Tokenizer.prefix_search . Pattern |
util.compile_suffix_regex function
Compile a sequence of suffix rules into a regex object.
Name | Description |
---|---|
entries | The suffix rules, e.g. lang.punctuation.TOKENIZER_SUFFIXES . Iterable[Union[str,Pattern]] |
RETURNS | The regex object to be used for Tokenizer.suffix_search . Pattern |
util.compile_infix_regex function
Compile a sequence of infix rules into a regex object.
Name | Description |
---|---|
entries | The infix rules, e.g. lang.punctuation.TOKENIZER_INFIXES . Iterable[Union[str,Pattern]] |
RETURNS | The regex object to be used for Tokenizer.infix_finditer . Pattern |
util.minibatch function
Iterate over batches of items. size
may be an iterator, so that batch-size can
vary on each step.
Name | Description |
---|---|
items | The items to batch up. Iterable[Any] |
size | The batch size(s). Union[int, Sequence[int]] |
YIELDS | The batches. |
util.filter_spans function
Filter a sequence of Span
objects and remove duplicates or
overlaps. Useful for creating named entities (where one token can only be part
of one entity) or when merging spans with
Retokenizer.merge
. When spans overlap, the
(first) longest span is preferred over shorter spans.
Name | Description |
---|---|
spans | The spans to filter. Iterable[Span] |
RETURNS | The filtered spans. List[Span] |
util.get_words_and_spaces functionv3.0
Given a list of words and a text, reconstruct the original tokens and return a
list of words and spaces that can be used to create a Doc
.
This can help recover destructive tokenization that didn’t preserve any
whitespace information.
Name | Description |
---|---|
words | The list of words. Iterable[str] |
text | The original text. str |
RETURNS | A list of words and a list of boolean values indicating whether the word at this position is followed by a space. Tuple[List[str], List[bool]] |