scikit

TextCategorizer
class
v2.0 This feature is new and was introduced in spaCy v2.0
Add text categorization models to spaCy pipelines.

The model supports classification with multiple, non-mutually exclusive labels. You can change the model architecture rather easily, but by default, the TextCategorizer class uses a convolutional neural network to assign position-sensitive vectors to each word in the document. The TextCategorizer uses its own CNN model, to avoid sharing weights with the other pipeline components. The document tensor is then summarized by concatenating max and mean pooling, and a multilayer perceptron is used to predict an output vector of length nr_class, before a logistic activation is applied elementwise. The value of each output neuron is the probability that some class is present.

TextCategorizer.Model
classmethod

Initialise a model for the pipe. The model should implement the thinc.neural.Model API. Wrappers are under development for most major machine learning libraries.

NameTypeDescription
**kwargs-Parameters for initialising the model
returnsobjectThe initialised model.

TextCategorizer.__init__
method

Create a new pipeline instance.

NameTypeDescription
vocabVocabThe shared vocabulary.
modelthinc.neural.Model or True The model powering the pipeline component. If no model is supplied, the model is created when you call begin_training, from_disk or from_bytes.
**cfg-Configuration parameters.
returnsTextCategorizerThe newly constructed object.

TextCategorizer.__call__
method

Apply the pipe to one document. The document is modified in place, and returned. Both TextCategorizer.__call__ and TextCategorizer.pipe should delegate to the TextCategorizer.predict and TextCategorizer.set_annotations methods.

NameTypeDescription
docDocThe document to process.
returnsDocThe processed document.

TextCategorizer.pipe
method

Apply the pipe to a stream of documents. Both TextCategorizer.__call__ and TextCategorizer.pipe should delegate to the TextCategorizer.predict and TextCategorizer.set_annotations methods.

NameTypeDescription
streamiterableA stream of documents.
batch_sizeintThe number of texts to buffer. Defaults to 128.
n_threadsint The number of worker threads to use. If -1, OpenMP will decide how many to use at run time. Default is -1.
yieldsDocProcessed documents in the order of the original text.

TextCategorizer.predict
method

Apply the pipeline's model to a batch of docs, without modifying them.

NameTypeDescription
docsiterableThe documents to predict.
returns-Scores from the model.

TextCategorizer.set_annotations
method

Modify a batch of documents, using pre-computed scores.

NameTypeDescription
docsiterableThe documents to modify.
scores-The scores to set, produced by TextCategorizer.predict.

TextCategorizer.update
method

Learn from a batch of documents and gold-standard information, updating the pipe's model. Delegates to TextCategorizer.predict and TextCategorizer.get_loss.

NameTypeDescription
docsiterableA batch of documents to learn from.
goldsiterableThe gold-standard data. Must have the same length as docs.
dropfloatThe dropout rate.
sgdcallable The optimizer. Should take two arguments weights and gradient, and an optional ID.
lossesdict Optional record of the loss during training. The value keyed by the model's name is updated.

TextCategorizer.get_loss
method

Find the loss and gradient of loss for the batch of documents and their predicted scores.

NameTypeDescription
docsiterableThe batch of documents.
goldsiterableThe gold-standard data. Must have the same length as docs.
scores-Scores representing the model's predictions.
returnstupleThe loss and the gradient, i.e. (loss, gradient).

TextCategorizer.begin_training
method

Initialise the pipe for training, using data exampes if available. If no model has been initialised yet, the model is added.

NameTypeDescription
gold_tuplesiterable Optional gold-standard annotations from which to construct GoldParse objects.
pipelinelist Optional list of Pipe components that this component is part of.
sgdcallable An optional optimizer. Should take two arguments weights and gradient, and an optional ID. Will be created via create_optimizer if not set.
returnscallableAn optimizer.

TextCategorizer.create_optimizer
method

Create an optmizer for the pipeline component.

NameTypeDescription
returnscallableThe optimizer.

TextCategorizer.use_params
method
contextmanager

Modify the pipe's model, to use the given parameter values.

NameTypeDescription
params- The parameter values to use in the model. At the end of the context, the original parameters are restored.

TextCategorizer.add_label
method

Add a new label to the pipe.

NameTypeDescription
labelunicodeThe label to add.

TextCategorizer.to_disk
method

Serialize the pipe to disk.

NameTypeDescription
pathunicode or Path A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects.

TextCategorizer.from_disk
method

Load the pipe from disk. Modifies the object in place and returns it.

NameTypeDescription
pathunicode or Path A path to a directory. Paths may be either strings or Path-like objects.
returnsTextCategorizerThe modified TextCategorizer object.

TextCategorizer.to_bytes
method

Serialize the pipe to a bytestring.

NameTypeDescription
**exclude-Named attributes to prevent from being serialized.
returnsbytesThe serialized form of the TextCategorizer object.

TextCategorizer.from_bytes
method

Load the pipe from a bytestring. Modifies the object in place and returns it.

NameTypeDescription
bytes_databytesThe data to load from.
**exclude-Named attributes to prevent from being loaded.
returnsTextCategorizerThe TextCategorizer object.