scikit

Cython Classes

Doc
cdef class
Source

The Doc object holds an array of TokenC structs.

Attributes

NameTypeDescription
memcymem.Pool A memory pool. Allocated memory will be freed once the Doc object is garbage collected.
vocabVocabA reference to the shared Vocab object.
cTokenC* A pointer to a TokenC struct.
lengthintThe number of tokens in the document.
max_lengthintThe underlying size of the Doc.c array.

Doc.push_back
method

Append a token to the Doc. The token can be provided as a LexemeC or TokenC pointer, using Cython's fused types.

NameTypeDescription
lex_or_tokLexemeOrTokenThe word to append to the Doc.
has_spacebintWhether the word has trailing whitespace.

Token
cdef class
Source

A Cython class providing access and methods for a TokenC struct. Note that the Token object does not own the struct. It only receives a pointer to it.

Attributes

NameTypeDescription
vocabVocabA reference to the shared Vocab object.
cTokenC* A pointer to a TokenC struct.
iintThe offset of the token within the document.
docDocThe parent document.

Token.cinit
method

Create a Token object from a TokenC* pointer.

NameTypeDescription
vocabVocabA reference to the shared Vocab.
cTokenC* A pointer to a TokenC struct.
offsetintThe offset of the token within the document.
docDocThe parent document.
returnsTokenThe newly constructed object.

Span
cdef class
Source

A Cython class providing access and methods for a slice of a Doc object.

Attributes

NameTypeDescription
docDocThe parent document.
startintThe index of the first token of the span.
endintThe index of the first token after the span.
start_charintThe index of the first character of the span.
end_charintThe index of the last character of the span.
labelattr_tA label to attach to the span, e.g. for named entities.

Lexeme
cdef class
Source

A Cython class providing access and methods for an entry in the vocabulary.

Attributes

NameTypeDescription
cLexemeC* A pointer to a LexemeC struct.
vocabVocabA reference to the shared Vocab object.
orthattr_tID of the verbatim text content.

Vocab
cdef class
Source

A Cython class providing access and methods for a vocabulary and other data shared across a language.

Attributes

NameTypeDescription
memcymem.Pool A memory pool. Allocated memory will be freed once the Vocab object is garbage collected.
stringsStringStore A StringStore that maps string to hash values and vice versa.
lengthintThe number of entries in the vocabulary.

Vocab.get
method

Retrieve a LexemeC* pointer from the vocabulary.

NameTypeDescription
memcymem.Pool A memory pool. Allocated memory will be freed once the Vocab object is garbage collected.
stringunicodeThe string of the word to look up.
returnsconst LexemeC*The lexeme in the vocabulary.

Vocab.get_by_orth
method

Retrieve a LexemeC* pointer from the vocabulary.

NameTypeDescription
memcymem.Pool A memory pool. Allocated memory will be freed once the Vocab object is garbage collected.
orthattr_tID of the verbatim text content.
returnsconst LexemeC*The lexeme in the vocabulary.

StringStore
cdef class
Source

A lookup table to retrieve strings by 64-bit hashes.

Attributes

NameTypeDescription
memcymem.Pool A memory pool. Allocated memory will be freed once the StringStore object is garbage collected.
keysvector[hash_t]A list of hash values in the StringStore.