Vocab

A look-up table that allows you to access Lexeme objects. The Vocab instance also provides access to the StringStore, and owns underlying C-data that is shared between Doc objects.

Attributes

NameTypeDescription
stringsStringStoreA table managing the string-to-int mapping.
vectors_lengthintThe dimensionality of the word vectors, if present.

Vocab.load

Load the vocabulary from a path.

NameTypeDescription
pathPathThe path to load from.
lex_attr_gettersdict A dictionary mapping attribute IDs to functions to compute them. Defaults to None.
lemmatizer-A lemmatizer. Defaults to None.
tag_mapdict A dictionary mapping fine-grained tags to coarse-grained parts-of-speech, and optionally morphological attributes.
oov_probfloatThe default probability for out-of-vocabulary words.
returnVocabThe newly constructed object.

Vocab.__init__

Create the vocabulary.

NameTypeDescription
lex_attr_gettersdict A dictionary mapping attribute IDs to functions to compute them. Defaults to None.
lemmatizer-A lemmatizer. Defaults to None.
tag_mapdict A dictionary mapping fine-grained tags to coarse-grained parts-of-speech, and optionally morphological attributes.
oov_probfloatThe default probability for out-of-vocabulary words.
returnVocabThe newly constructed object.

Vocab.__len__

Get the number of lexemes in the vocabulary.

NameTypeDescription
returnintThe number of lexems in the vocabulary.

Vocab.__getitem__

Retrieve a lexeme, given an int ID or a unicode string. If a previously unseen unicode string is given, a new lexeme is created and stored.

NameTypeDescription
id_or_stringint / unicodeThe integer ID of a word, or its unicode string.
returnLexemeThe lexeme indicated by the given ID.

Span.__iter__

Iterate over the lexemes in the vocabulary.

NameTypeDescription
yieldLexemeAn entry in the vocabulary.

Vocab.__contains__

Check whether the string has an entry in the vocabulary.

NameTypeDescription
stringunicodeThe ID string.
returnboolWhether the string has an entry in the vocabulary.

Vocab.resize_vectors

Set vectors_length to a new size, and allocate more memory for the Lexeme vectors if necessary. The memory will be zeroed.

NameTypeDescription
new_sizeintThe new size of the vectors.
returnNone-

Vocab.add_flag

Set a new boolean flag to words in the vocabulary.

NameTypeDescription
flag_getterdictA function f(unicode) -> bool, to get the flag value.
flag_idint An integer between 1 and 63 (inclusive), specifying the bit at which the flag will be stored. If -1, the lowest available bit will be chosen.
returnintThe integer ID by which the flag value can be checked.

Vocab.dump

Save the lexemes binary data to the given location.

NameTypeDescription
locPathThe path to load from.
returnNone-

Vocab.load_lexemes

NameTypeDescription
locunicodePath to load the lexemes.bin file from.
returnNone-

Vocab.dump_vectors

Save the word vectors to a binary file.

NameTypeDescription
locPathThe path to save to.
returnNone-

Vocab.load_vectors

Load vectors from a text-based file.

NameTypeDescription
file_buffer The file to read from. Entries should be separated by newlines, and each entry should be whitespace delimited. The first value of the entry should be the word string, and subsequent entries should be the values of the vector.
returnintThe length of the vectors loaded.

Vocab.load_vectors_from_bin_loc

Load vectors from the location of a binary file.

NameTypeDescription
locunicodeThe path of the binary file to load from.
returnintThe length of the vectors loaded.