Language

A text processing pipeline.

Attributes

NameTypeDescription
vocabVocabA container for the lexical types.
tokenizerTokenizerFind word boundaries and create Doc object.
taggerTaggerAnnotate Doc objects with POS tags.
parserDependencyParserAnnotate Doc objects with syntactic dependencies.
entityEntityRecognizerAnnotate Doc objects with named entities.
matcherMatcherRule-based sequence matcher.
make_doclambda text: DocCreate a Doc object from unicode text.
pipeline-Sequence of annotation functions.

Language.__init__

Create or load the pipeline.

NameTypeDescription
**overrides-Keyword arguments indicating which defaults to override.
returnLanguageThe newly constructed object.

Language.__call__

Apply the pipeline to a single text.

NameTypeDescription
textunicodeThe text to be processed.
tagboolWhether to apply the part-of-speech tagger.
parseboolWhether to apply the syntactic dependency parser.
entityboolWhether to apply the named entity recognizer.
returnDocA container for accessing the linguistic annotations.

Language.pipe

Process texts as a stream, and yield Doc objects in order. Supports GIL-free multi-threading.

NameTypeDescription
texts-A sequence of unicode objects.
n_threadsint The number of worker threads to use. If -1, OpenMP will decide how many to use at run time. Default is 2.
batch_sizeintThe number of texts to buffer.
yieldDocContainers for accessing the linguistic annotations.

Language.save_to_directory

Save the Vocab, StringStore and pipeline to a directory.

NameTypeDescription
pathstring or pathlib pathPath to save the model.
returnNone-