Tokenizer

Segment text, and create Doc objects with the discovered segment boundaries.

Attributes

NameTypeDescription
vocabVocabThe vocab object of the parent Doc.
prefix_search- A function to find segment boundaries from the start of a string. Returns the length of the segment, or None.
suffix_search- A function to find segment boundaries from the end of a string. Returns the length of the segment, or None.
infix_finditer- A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of re.MatchObject objects.

Tokenizer.load

Load a Tokenizer, reading unsupplied components from the path.

NameTypeDescription
pathPathThe path to load from.
vocabVocabA storage container for lexical types.
rulesdictExceptions and special-cases for the tokenizer.
prefix_searchcallable A function matching the signature of re.compile(string).search to match prefixes.
suffix_searchcallable A function matching the signature of re.compile(string).search to match suffixes.
infix_finditercallable A function matching the signature of re.compile(string).finditer to find infixes.
returnTokenizerThe newly constructed object.

Tokenizer.__init__

Create a Tokenizer, to create Doc objects given unicode text.

NameTypeDescription
vocabVocabA storage container for lexical types.
rulesdictExceptions and special-cases for the tokenizer.
prefix_searchcallable A function matching the signature of re.compile(string).search to match prefixes.
suffix_searchcallable A function matching the signature of re.compile(string).search to match suffixes.
infix_finditercallable A function matching the signature of re.compile(string).finditer to find infixes.
returnTokenizerThe newly constructed object.

Tokenizer.__call__

Tokenize a string.

NameTypeDescription
stringunicodeThe string to tokenize.
returnDocA container for linguistic annotations.

Tokenizer.pipe

Tokenize a stream of texts.

NameTypeDescription
texts-A sequence of unicode texts.
batch_sizeintThe number of texts to accumulate in an internal buffer.
n_threadsint The number of threads to use, if the implementation supports multi-threading. The default tokenizer is single-threaded.
yieldDocA sequence of Doc objects, in order.

Tokenizer.find_infix

Find internal split points of the string.

NameTypeDescription
stringunicodeThe string to split.
returnList[re.MatchObject] A list of objects that have .start() and .end() methods, denoting the placement of internal segment separators, e.g. hyphens.

Tokenizer.find_prefix

Find the length of a prefix that should be segmented from the string, or None if no prefix rules match.

NameTypeDescription
stringunicodeThe string to segment.
returnint / NoneThe length of the prefix if present, otherwise None.

Tokenizer.find_suffix

Find the length of a suffix that should be segmented from the string, or None if no suffix rules match.

NameTypeDescription
stringunicodeThe string to segment.
returnint / NoneThe length of the suffix if present, otherwise None.

Tokenizer.add_special_case

Add a special-case tokenization rule.

NameTypeDescription
stringunicodeThe string to specially tokenize.
token_attrs- A sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated.
returnNone-