scikit

Cython Structs
C-language objects that let you group variables together in a single contiguous block.

TokenC
C struct
Source

Cython data container for the Token object.

NameTypeDescription
lexconst LexemeC*A pointer to the lexeme for the token.
morphuint64_tAn ID allowing lookup of morphological attributes.
posuniv_pos_tCoarse-grained part-of-speech tag.
spacybintA binary value indicating whether the token has trailing whitespace.
tagattr_tFine-grained part-of-speech tag.
idxintThe character offset of the token within the parent document.
lemmaattr_tBase form of the token, with no inflectional suffixes.
senseattr_tSpace for storing a word sense ID, currently unused.
headintOffset of the syntactic parent relative to the token.
depattr_tSyntactic dependency relation.
l_kidsuint32_tNumber of left children.
r_kidsuint32_tNumber of right children.
l_edgeuint32_tOffset of the leftmost token of this token's syntactic descendents.
r_edgeuint32_tOffset of the rightmost token of this token's syntactic descendents.
sent_startint Ternary value indicating whether the token is the first word of a sentence. 0 indicates a missing value, -1 indicates False and 1 indicates True. The default value, 0, is interpretted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary.
ent_iobint IOB code of named entity tag. 0 indicates a missing value, 1 indicates I, 2 indicates 0 and 3 indicates B.
ent_typeattr_tNamed entity type.
ent_idhash_t ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution.

Token.get_struct_attr
staticmethod
nogil
Source

Get the value of an attribute from the TokenC struct by attribute ID.

NameTypeDescription
tokenconst TokenC*A pointer to a TokenC struct.
feat_nameattr_id_t The ID of the attribute to look up. The attributes are enumerated in spacy.typedefs.
returnsattr_tThe value of the attribute.

Token.set_struct_attr
staticmethod
nogil
Source

Set the value of an attribute of the TokenC struct by attribute ID.

NameTypeDescription
tokenconst TokenC*A pointer to a TokenC struct.
feat_nameattr_id_t The ID of the attribute to look up. The attributes are enumerated in spacy.typedefs.
valueattr_tThe value to set.

token_by_start
function
Source

Find a token in a TokenC* array by the offset of its first character.

NameTypeDescription
tokensconst TokenC*A TokenC* array.
lengthintThe number of tokens in the array.
start_charintThe start index to search for.
returnsintThe index of the token in the array or -1 if not found.

token_by_end
function
Source

Find a token in a TokenC* array by the offset of its final character.

NameTypeDescription
tokensconst TokenC*A TokenC* array.
lengthintThe number of tokens in the array.
end_charintThe end index to search for.
returnsintThe index of the token in the array or -1 if not found.

set_children_from_heads
function
Source

Set attributes that allow lookup of syntactic children on a TokenC* array. This function must be called after making changes to the TokenC.head attribute, in order to make the parse tree navigation consistent.

NameTypeDescription
tokensconst TokenC*A TokenC* array.
lengthintThe number of tokens in the array.

LexemeC
C struct
Source

Struct holding information about a lexical type. LexemeC structs are usually owned by the Vocab, and accessed through a read-only pointer on the TokenC struct.

NameTypeDescription
flagsflags_tBit-field for binary lexical flag values.
idattr_t Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed.
lengthattr_tNumber of unicode characters in the lexeme.
orthattr_tID of the verbatim text content.
lowerattr_tID of the lowercase form of the lexeme.
normattr_tID of the lexeme's norm, i.e. a normalised form of the text.
shapeattr_tTransform of the lexeme's string, to show orthographic features.
prefixattr_t Length-N substring from the start of the lexeme. Defaults to N=1.
suffixattr_t Length-N substring from the end of the lexeme. Defaults to N=3.
clusterattr_tBrown cluster ID.
probfloatSmoothed log probability estimate of the lexeme's type.
sentimentfloatA scalar value indicating positivity or negativity.

Lexeme.get_struct_attr
staticmethod
nogil
Source

Get the value of an attribute from the LexemeC struct by attribute ID.

NameTypeDescription
lexconst LexemeC*A pointer to a LexemeC struct.
feat_nameattr_id_t The ID of the attribute to look up. The attributes are enumerated in spacy.typedefs.
returnsattr_tThe value of the attribute.

Lexeme.set_struct_attr
staticmethod
nogil
Source

Set the value of an attribute of the LexemeC struct by attribute ID.

NameTypeDescription
lexconst LexemeC*A pointer to a LexemeC struct.
feat_nameattr_id_t The ID of the attribute to look up. The attributes are enumerated in spacy.typedefs.
valueattr_tThe value to set.

Lexeme.c_check_flag
staticmethod
nogil
Source

Check the value of a binary flag attribute.

NameTypeDescription
lexemeconst LexemeC*A pointer to a LexemeC struct.
flag_idattr_id_t The ID of the flag to look up. The flag IDs are enumerated in spacy.typedefs.
returnsbintThe boolean value of the flag.

Lexeme.c_set_flag
staticmethod
nogil
Source

Set the value of a binary flag attribute.

NameTypeDescription
lexemeconst LexemeC*A pointer to a LexemeC struct.
flag_idattr_id_t The ID of the flag to look up. The flag IDs are enumerated in spacy.typedefs.
valuebintThe value to set.