Containers

Token

class
An individual token — i.e. a word, punctuation symbol, whitespace, etc.

Token.__init__ method

Construct a Token object.

NameTypeDescription
vocabVocabA storage container for lexical types.
docDocThe parent document.
offsetintThe index of the token within the document.

Token.__len__ method

The number of unicode characters in the token, i.e. token.text.

NameTypeDescription

Token.set_extension classmethodv2.0

Define a custom attribute on the Token which becomes available via Token._. For details, see the documentation on custom attributes.

NameTypeDescription
nameunicodeName of the attribute to set by the extension. For example, 'my_attr' will be available as token._.my_attr.
default-Optional default value of the attribute if no getter or method is defined.
methodcallableSet a custom method on the object, for example token._.compare(other_token).
gettercallableGetter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute.
settercallableSetter function that takes the Token and a value, and modifies the object. Is called when the user writes to the Token._ attribute.

Token.get_extension classmethodv2.0

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

NameTypeDescription
nameunicodeName of the extension.

Token.has_extension classmethodv2.0

Check whether an extension has been registered on the Token class.

NameTypeDescription
nameunicodeName of the extension to check.

Token.remove_extension classmethod

Remove a previously registered extension.

NameTypeDescription
nameunicodeName of the extension.

Token.check_flag method

Check the value of a boolean flag.

NameTypeDescription
flag_idintThe attribute ID of the flag to check.

Token.similarity methodNeeds model

Compute a semantic similarity estimate. Defaults to cosine over vectors.

NameTypeDescription
other-The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects.

Token.nbor method

Get a neighboring token.

NameTypeDescription
iintThe relative position of the token to get. Defaults to 1.

Token.is_ancestor methodNeeds model

Check whether this token is a parent, grandparent, etc. of another in the dependency tree.

NameTypeDescription
descendantTokenAnother token.

Token.ancestors propertyNeeds model

The rightmost token of this token’s syntactic descendants.

NameTypeDescription

Token.conjuncts propertyNeeds model

A tuple of coordinated tokens, not including the token itself.

NameTypeDescription

Token.children propertyNeeds model

A sequence of the token’s immediate syntactic children.

NameTypeDescription

Token.lefts propertyNeeds model

The leftward immediate children of the word, in the syntactic dependency parse.

NameTypeDescription

Token.rights propertyNeeds model

The rightward immediate children of the word, in the syntactic dependency parse.

NameTypeDescription

Token.n_lefts propertyNeeds model

The number of leftward immediate children of the word, in the syntactic dependency parse.

NameTypeDescription

Token.n_rights propertyNeeds model

The number of rightward immediate children of the word, in the syntactic dependency parse.

NameTypeDescription

Token.subtree propertyNeeds model

A sequence containing the token and all the token’s syntactic descendants.

NameTypeDescription

Token.is_sent_start propertyv2.0

A boolean value indicating whether the token starts a sentence. None if unknown. Defaults to True for the first token in the Doc.

NameTypeDescription

Token.has_vector propertyNeeds model

A boolean value indicating whether a word vector is associated with the token.

NameTypeDescription

Token.vector propertyNeeds model

A real-valued meaning representation.

NameTypeDescription

Token.vector_norm propertyNeeds model

The L2 norm of the token’s vector representation.

NameTypeDescription

Attributes

NameTypeDescription
docDocThe parent document.
sent v2.0.12SpanThe sentence span that this token is a part of.
textunicodeVerbatim text content.
text_with_wsunicodeText content, with trailing space character if present.
whitespace_unicodeTrailing space character if present.
orthintID of the verbatim text content.
orth_unicodeVerbatim text content (identical to Token.text). Exists mostly for consistency with the other attributes.
vocabVocabThe vocab object of the parent Doc.
docDocThe parent document.
headTokenThe syntactic parent, or “governor”, of this token.
left_edgeTokenThe leftmost token of this token’s syntactic descendants.
right_edgeTokenThe rightmost token of this token’s syntactic descendants.
iintThe index of the token within the parent document.
ent_typeintNamed entity type.
ent_type_unicodeNamed entity type.
ent_iobintIOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set.
ent_iob_unicodeIOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set.
ent_idintID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.
ent_id_unicodeID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.
lemmaintBase form of the token, with no inflectional suffixes.
lemma_unicodeBase form of the token, with no inflectional suffixes.
normintThe token’s norm, i.e. a normalized form of the token text. Usually set in the language’s tokenizer exceptions or norm exceptions.
norm_unicodeThe token’s norm, i.e. a normalized form of the token text. Usually set in the language’s tokenizer exceptions or norm exceptions.
lowerintLowercase form of the token.
lower_unicodeLowercase form of the token text. Equivalent to Token.text.lower().
shapeintTransform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “dd”.
shape_unicodeTransform of the tokens’s string, to show orthographic features. For example, “Xxxx” or “dd”.
prefixintHash value of a length-N substring from the start of the token. Defaults to N=1.
prefix_unicodeA length-N substring from the start of the token. Defaults to N=1.
suffixintHash value of a length-N substring from the end of the token. Defaults to N=3.
suffix_unicodeLength-N substring from the end of the token. Defaults to N=3.
is_alphaboolDoes the token consist of alphabetic characters? Equivalent to token.text.isalpha().
is_asciiboolDoes the token consist of ASCII characters? Equivalent to all(ord(c) < 128 for c in token.text).
is_digitboolDoes the token consist of digits? Equivalent to token.text.isdigit().
is_lowerboolIs the token in lowercase? Equivalent to token.text.islower().
is_upperboolIs the token in uppercase? Equivalent to token.text.isupper().
is_titleboolIs the token in titlecase? Equivalent to token.text.istitle().
is_punctboolIs the token punctuation?
is_left_punctboolIs the token a left punctuation mark, e.g. (?
is_right_punctboolIs the token a right punctuation mark, e.g. )?
is_spaceboolDoes the token consist of whitespace characters? Equivalent to token.text.isspace().
is_bracketboolIs the token a bracket?
is_quoteboolIs the token a quotation mark?
is_currency v2.0.8boolIs the token a currency symbol?
like_urlboolDoes the token resemble a URL?
like_numboolDoes the token represent a number? e.g. “10.9”, “10”, “ten”, etc.
like_emailboolDoes the token resemble an email address?
is_oovboolIs the token out-of-vocabulary?
is_stopboolIs the token part of a “stop list”?
posintCoarse-grained part-of-speech.
pos_unicodeCoarse-grained part-of-speech.
tagintFine-grained part-of-speech.
tag_unicodeFine-grained part-of-speech.
depintSyntactic dependency relation.
dep_unicodeSyntactic dependency relation.
langintLanguage of the parent document’s vocabulary.
lang_unicodeLanguage of the parent document’s vocabulary.
probfloatSmoothed log probability estimate of token’s type.
idxintThe character offset of the token within the parent document.
sentimentfloatA scalar value indicating the positivity or negativity of the token.
lex_idintSequential ID of the token’s lexical type.
rankintSequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors.
clusterintBrown cluster ID.
_UnderscoreUser space for adding custom attribute extensions.