Doc

A container for accessing linguistic annotations.

Attributes

NameTypeDescription
memPoolThe document's local memory heap, for all C data it owns.
vocabVocabThe store of lexical types.
user_data-A generic storage area, for user custom data.
is_taggedbool A flag indicating that the document has been part-of-speech tagged.
is_parsedboolA flag indicating that the document has been syntactically parsed.
sentimentfloatThe document's positivity/negativity score, if available.
user_hooksdict A dictionary that allows customisation of the Doc's properties.
user_token_hooksdict A dictionary that allows customisation of properties of Token children.
user_span_hooksdict A dictionary that allows customisation of properties of Span children.

Doc.__init__

Construct a Doc object.

NameTypeDescription
vocabVocabA storage container for lexical types.
words-A list of strings to add to the container.
spaces- A list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True.
returnDocThe newly constructed object.

Doc.__getitem__

Get a Token object.

NameTypeDescription
iintThe index of the token.
returnTokenThe token at doc[i].

Get a Span object.

NameTypeDescription
start_endtupleThe slice of the document to get.
returnSpanThe span at doc[start : end].

Doc.__iter__

Iterate over Token objects.

NameTypeDescription
yieldTokenA Token object.

Doc.__len__

Get the number of tokens in the document.

NameTypeDescription
returnintThe number of tokens in the document.

Doc.similarity

Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.

NameTypeDescription
other- The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects.
returnfloatA scalar similarity score. Higher is more similar.

Doc.to_array

Export the document annotations to a numpy array of shape N*M where N is the length of the document and M is the number of attribute IDs to export. The values will be 32-bit integers.

NameTypeDescription
attr_idsintsA list of attribute ID ints.
returnnumpy.ndarray[ndim=2, dtype='int32'] The exported attributes as a 2D numpy array, with one row per token and one column per attribute.

Doc.count_by

Count the frequencies of a given attribute.

NameTypeDescription
attr_idintThe attribute ID
returndictA dictionary mapping attributes to integer counts.

Doc.from_array

Load attributes from a numpy array.

NameTypeDescription
attr_idsintsA list of attribute ID ints.
valuesnumpy.ndarray[ndim=2, dtype='int32']The attribute values to load.
returnNone-

Doc.to_bytes

Export the document contents to a binary string.

NameTypeDescription
returnbytes A losslessly serialized copy of the Doc including all annotations.

Doc.from_bytes

Import the document contents from a binary string.

NameTypeDescription
byte_stringbytesThe string to load from.
returnDocThe self variable.

Doc.merge

Retokenize the document, such that the span at doc.text[start_idx : end_idx] is merged into a single token. If start_idx and end_idx do not mark start and end token boundaries, the document remains unchanged.

NameTypeDescription
start_idxintThe character index of the start of the slice to merge.
end_idxintThe character index after the end of the slice to merge.
**attributes- Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span.
returnToken The newly merged token, or None if the start and end indices did not fall at token boundaries

Doc.read_bytes

A static method, used to read serialized Doc objects from a file.

NameTypeDescription
filebufferA binary buffer to read the serialized annotations from.
yieldbytesBinary strings from with documents can be loaded.

Doc.text

A unicode representation of the document text.

NameTypeDescription
returnunicodeThe original verbatim text of the document.

Doc.text_with_ws

An alias of Doc.text, provided for duck-type compatibility with Span and Token.

NameTypeDescription
returnunicodeThe original verbatim text of the document.

Doc.sents

Iterate over the sentences in the document.

NameTypeDescription
yieldSpanSentences in the document.

Doc.ents

Iterate over the entities in the document.

NameTypeDescription
yieldSpanEntities in the document.

Doc.noun_chunks

Iterate over the base noun phrases in the document. A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it.

NameTypeDescription
yieldSpanNoun chunks in the document

Doc.vector

A real-valued meaning representation. Defaults to an average of the token vectors.

NameTypeDescription
returnnumpy.ndarray[ndim=1, dtype='float32']A 1D numpy array representing the document's semantics.

Doc.has_vector

A boolean value indicating whether a word vector is associated with the object.

NameTypeDescription
returnboolWhether the document has a vector data attached.