Containers

Token

class
An individual token — i.e. a word, punctuation symbol, whitespace, etc.

Token.__init__ method

Construct a Token object.

NameDescription
vocabA storage container for lexical types. Vocab
docThe parent document. Doc
offsetThe index of the token within the document. int

Token.__len__ method

The number of unicode characters in the token, i.e. token.text.

NameDescription

Token.set_extension classmethod

Define a custom attribute on the Token which becomes available via Token._. For details, see the documentation on custom attributes.

NameDescription
nameName of the attribute to set by the extension. For example, "my_attr" will be available as token._.my_attr. str
defaultOptional default value of the attribute if no getter or method is defined. Optional[Any]
methodSet a custom method on the object, for example token._.compare(other_token). Optional[Callable[[Token,], Any]]
getterGetter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute. Optional[Callable[[Token], Any]]
setterSetter function that takes the Token and a value, and modifies the object. Is called when the user writes to the Token._ attribute. Optional[Callable[[Token, Any], None]]
forceForce overwriting existing attribute. bool

Token.get_extension classmethod

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

NameDescription
nameName of the extension. str

Token.has_extension classmethod

Check whether an extension has been registered on the Token class.

NameDescription
nameName of the extension to check. str

Token.remove_extension classmethod

Remove a previously registered extension.

NameDescription
nameName of the extension. str

Token.check_flag method

Check the value of a boolean flag.

NameDescription
flag_idThe attribute ID of the flag to check. int

Token.similarity methodNeeds model

Compute a semantic similarity estimate. Defaults to cosine over vectors.

NameDescription
otherThe object to compare with. By default, accepts Doc, Span, Token and Lexeme objects. Union[Doc, Span, Token, Lexeme]

Token.nbor method

Get a neighboring token.

NameDescription
iThe relative position of the token to get. Defaults to 1. int

Token.set_morph method

Set the morphological analysis from a UD FEATS string, hash value of a UD FEATS string, features dict or MorphAnalysis. The value None can be used to reset the morph to an unset state.

NameDescription
featuresThe morphological features to set. Union[int, dict, str, MorphAnalysis, None]

Token.has_morph method

Check whether the token has annotated morph information. Return False when the morph annotation is unset/missing.

NameDescription

Token.is_ancestor methodNeeds model

Check whether this token is a parent, grandparent, etc. of another in the dependency tree.

NameDescription
descendantAnother token. Token

Token.ancestors propertyNeeds model

The rightmost token of this token’s syntactic descendants.

NameDescription

Token.conjuncts propertyNeeds model

A tuple of coordinated tokens, not including the token itself.

NameDescription

Token.children propertyNeeds model

A sequence of the token’s immediate syntactic children.

NameDescription

Token.lefts propertyNeeds model

The leftward immediate children of the word in the syntactic dependency parse.

NameDescription

Token.rights propertyNeeds model

The rightward immediate children of the word in the syntactic dependency parse.

NameDescription

Token.n_lefts propertyNeeds model

The number of leftward immediate children of the word in the syntactic dependency parse.

NameDescription

Token.n_rights propertyNeeds model

The number of rightward immediate children of the word in the syntactic dependency parse.

NameDescription

Token.subtree propertyNeeds model

A sequence containing the token and all the token’s syntactic descendants.

NameDescription

Token.is_sent_start property

A boolean value indicating whether the token starts a sentence. None if unknown. Defaults to True for the first token in the Doc.

NameDescription

Token.has_vector propertyNeeds model

A boolean value indicating whether a word vector is associated with the token.

NameDescription

Token.vector propertyNeeds model

A real-valued meaning representation.

NameDescription

Token.vector_norm propertyNeeds model

The L2 norm of the token’s vector representation.

NameDescription

Attributes

NameDescription
docThe parent document. Doc
lex v3.0The underlying lexeme. Lexeme
sent The sentence span that this token is a part of. Span
textVerbatim text content. str
text_with_wsText content, with trailing space character if present. str
whitespace_Trailing space character if present. str
orthID of the verbatim text content. int
orth_Verbatim text content (identical to Token.text). Exists mostly for consistency with the other attributes. str
vocabThe vocab object of the parent Doc. vocab
tensor The tokens’s slice of the parent Doc’s tensor. numpy.ndarray
headThe syntactic parent, or “governor”, of this token. Token
left_edgeThe leftmost token of this token’s syntactic descendants. Token
right_edgeThe rightmost token of this token’s syntactic descendants. Token
iThe index of the token within the parent document. int
ent_typeNamed entity type. int
ent_type_Named entity type. str
ent_iobIOB code of named entity tag. 3 means the token begins an entity, 2 means it is outside an entity, 1 means it is inside an entity, and 0 means no entity tag is set. int
ent_iob_IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set. str
ent_kb_id Knowledge base ID that refers to the named entity this token is a part of, if any. int
ent_kb_id_ Knowledge base ID that refers to the named entity this token is a part of, if any. str
ent_idID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. int
ent_id_ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. str
lemmaBase form of the token, with no inflectional suffixes. int
lemma_Base form of the token, with no inflectional suffixes. str
normThe token’s norm, i.e. a normalized form of the token text. Can be set in the language’s tokenizer exceptions. int
norm_The token’s norm, i.e. a normalized form of the token text. Can be set in the language’s tokenizer exceptions. str
lowerLowercase form of the token. int
lower_Lowercase form of the token text. Equivalent to Token.text.lower(). str
shapeTransform of the tokens’s string to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,"Xxxx"or"dd". int
shape_Transform of the tokens’s string to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,"Xxxx"or"dd". str
prefixHash value of a length-N substring from the start of the token. Defaults to N=1. int
prefix_A length-N substring from the start of the token. Defaults to N=1. str
suffixHash value of a length-N substring from the end of the token. Defaults to N=3. int
suffix_Length-N substring from the end of the token. Defaults to N=3. str
is_alphaDoes the token consist of alphabetic characters? Equivalent to token.text.isalpha(). bool
is_asciiDoes the token consist of ASCII characters? Equivalent to all(ord(c) < 128 for c in token.text). bool
is_digitDoes the token consist of digits? Equivalent to token.text.isdigit(). bool
is_lowerIs the token in lowercase? Equivalent to token.text.islower(). bool
is_upperIs the token in uppercase? Equivalent to token.text.isupper(). bool
is_titleIs the token in titlecase? Equivalent to token.text.istitle(). bool
is_punctIs the token punctuation? bool
is_left_punctIs the token a left punctuation mark, e.g. "(" ? bool
is_right_punctIs the token a right punctuation mark, e.g. ")" ? bool
is_spaceDoes the token consist of whitespace characters? Equivalent to token.text.isspace(). bool
is_bracketIs the token a bracket? bool
is_quoteIs the token a quotation mark? bool
is_currency Is the token a currency symbol? bool
like_urlDoes the token resemble a URL? bool
like_numDoes the token represent a number? e.g. “10.9”, “10”, “ten”, etc. bool
like_emailDoes the token resemble an email address? bool
is_oovIs the token out-of-vocabulary (i.e. does it not have a word vector)? bool
is_stopIs the token part of a “stop list”? bool
posCoarse-grained part-of-speech from the Universal POS tag set. int
pos_Coarse-grained part-of-speech from the Universal POS tag set. str
tagFine-grained part-of-speech. int
tag_Fine-grained part-of-speech. str
morph v3.0Morphological analysis. MorphAnalysis
depSyntactic dependency relation. int
dep_Syntactic dependency relation. str
langLanguage of the parent document’s vocabulary. int
lang_Language of the parent document’s vocabulary. str
probSmoothed log probability estimate of token’s word type (context-independent entry in the vocabulary). float
idxThe character offset of the token within the parent document. int
sentimentA scalar value indicating the positivity or negativity of the token. float
lex_idSequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors. int
rankSequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors. int
clusterBrown cluster ID. int
_User space for adding custom attribute extensions. Underscore