Token

An individual token — i.e. a word, punctuation symbol, whitespace, etc.

Attributes

NameTypeDescription
vocabVocabThe vocab object of the parent Doc.
docDocThe parent document.
iintThe index of the token within the parent document.
ent_typeintNamed entity type.
ent_type_unicodeNamed entity type.
ent_iobint IOB code of named entity tag. 1="I", 2="O", 3="B". 0 means no tag is assigned.
ent_iob_unicode IOB code of named entity tag. "B" means the token begins an entity, "I" means it inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.
ent_idintID of the entity the token is an instance of, if any.
ent_id_unicodeID of the entity the token is an instance of, if any.
lemmaint Base form of the word, with no inflectional suffixes.
lemma_unicodeBase form of the word, with no inflectional suffixes.
orthintword's string.
orth_unicodeword's string.
lowerintLower-case form of the word.
lower_unicodeLower-case form of the word.
shapeintTransform of the word's string, to show orthographic features.
shape_unicodeA transform of the word's string, to show orthographic features.
prefixintInteger ID of a length-N substring from the start of the word. Defaults to N=1.
prefix_unicode A length-N substring from the start of the word. Defaults to N=1.
suffixint Length-N substring from the end of the word. Defaults to N=3.
suffix_unicodeLength-N substring from the end of the word. Defaults to N=3.
is_alphaboolEquivalent to word.orth_.isalpha().
is_asciiboolEquivalent to [any(ord(c) >= 128 for c in word.orth_)].
is_digitboolEquivalent to word.orth_.isdigit().
is_lowerboolEquivalent to word.orth_.islower().
is_titleboolEquivalent to word.orth_.istitle().
is_punctboolEquivalent to word.orth_.ispunct().
is_spaceboolEquivalent to word.orth_.isspace().
like_urlboolDoes the word resemble a URL?
like_numboolDoes the word represent a number? e.g. “10.9”, “10”, “ten”, etc.
like_emailboolDoes the word resemble an email address?
is_oovboolIs the word out-of-vocabulary?
is_stopboolIs the word part of a "stop list"?
posintCoarse-grained part-of-speech.
pos_unicodeCoarse-grained part-of-speech.
tagintFine-grained part-of-speech.
tag_unicodeFine-grained part-of-speech.
depintSyntactic dependency relation.
dep_unicodeSyntactic dependency relation.
langintLanguage of the parent document's vocabulary.
lang_unicodeLanguage of the parent document's vocabulary.
probfloatSmoothed log probability estimate of token's type.
idxintThe character offset of the token within the parent document.
sentimentfloatA scalar value indicating the positivity or negativity of the token.
lex_idintID of the token's lexical type.
textunicodeVerbatim text content.
text_with_wsunicodeText content, with trailing space character if present.
whitespace_unicodeTrailing space character if present.

Token.__init__

Construct a Token object.

NameTypeDescription
vocabVocabA storage container for lexical types.
docDocThe parent document.
offsetintThe index of the token within the document.
returnTokenThe newly constructed object.

Token.__len__

Get the number of unicode characters in the token.

NameTypeDescription
returnintThe number of unicode characters in the token.

Token.check_flag

Check the value of a boolean flag.

NameTypeDescription
flag_idintThe attribute ID of the flag to check.
returnboolWhether the flag is set.

Token.nbor

Get a neighboring token.

NameTypeDescription
iintThe relative position of the token to get. Defaults to 1.
returnTokenThe token at position self.doc[self.i+i]

Token.similarity

Compute a semantic similarity estimate. Defaults to cosine over vectors.

NameTypeDescription
other- The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects.
returnfloatA scalar similarity score. Higher is more similar.

Token.is_ancestor

Check whether this token is a parent, grandparent, etc. of another in the dependency tree.

NameTypeDescription
descendantTokenAnother token.
returnboolWhether this token is the ancestor of the descendant.

Token.vector

A real-valued meaning representation.

NameTypeDescription
returnnumpy.ndarray[ndim=1, dtype='float32']A 1D numpy array representing the token's semantics.

Token.has_vector

A boolean value indicating whether a word vector is associated with the object.

NameTypeDescription
returnboolWhether the token has a vector data attached.

Token.head

The syntactic parent, or "governor", of this token.

NameTypeDescription
returnTokenThe head.

Token.conjuncts

A sequence of coordinated tokens, including the token itself.

NameTypeDescription
yieldTokenA coordinated token.

Token.children

A sequence of the token's immediate syntactic children.

NameTypeDescription
yieldTokenA child token such that child.head==self.

Token.subtree

A sequence of all the token's syntactic descendents.

NameTypeDescription
yieldTokenA descendant token such that self.is_ancestor(descendant).

Token.left_edge

The leftmost token of this token's syntactic descendants.

NameTypeDescription
returnTokenThe first token such that self.is_ancestor(token).

Token.right_edge

The rightmost token of this token's syntactic descendents.

NameTypeDescription
returnTokenThe last token such that self.is_ancestor(token).

Token.ancestors

The rightmost token of this token's syntactic descendants.

NameTypeDescription
yieldToken A sequence of ancestor tokens such that ancestor.is_ancestor(self).