scikit

Doc
class
A container for accessing linguistic annotations.

A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. The Doc object holds an array of TokenC structs. The Python-level Token and Span objects are views of this array, i.e. they don't own the data themselves.

Doc.__init__
method

Construct a Doc object. The most common way to get a Doc object is via the nlp object.

NameTypeDescription
vocabVocabA storage container for lexical types.
words-A list of strings to add to the container.
spaces- A list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True.
returnsDocThe newly constructed object.

Doc.__getitem__
method

Get a Token object at position i, where i is an integer. Negative indexing is supported, and follows the usual Python semantics, i.e. doc[-2] is doc[len(doc) - 2].

NameTypeDescription
iintThe index of the token.
returnsTokenThe token at doc[i].

Get a Span object, starting at position start (token index) and ending at position end (token index).

For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as Span objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics.

NameTypeDescription
start_endtupleThe slice of the document to get.
returnsSpanThe span at doc[start : end].

Doc.__iter__
method

Iterate over Token objects, from which the annotations can be easily accessed.

This is the main way of accessing Token objects, which are the main way annotations are accessed from Python. If faster-than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython.

NameTypeDescription
yieldsTokenA Token object.

Doc.__len__
method

Get the number of tokens in the document.

NameTypeDescription
returnsintThe number of tokens in the document.

Doc.set_extension
classmethod
v2.0 This feature is new and was introduced in spaCy v2.0

Define a custom attribute on the Doc which becomes available via Doc._. For details, see the documentation on custom attributes.

NameTypeDescription
nameunicode Name of the attribute to set by the extension. For example, 'my_attr' will be available as doc._.my_attr.
default- Optional default value of the attribute if no getter or method is defined.
methodcallable Set a custom method on the object, for example doc._.compare(other_doc).
gettercallable Getter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute.
settercallable Setter function that takes the Doc and a value, and modifies the object. Is called when the user writes to the Doc._ attribute.

Doc.get_extension
classmethod
v2.0 This feature is new and was introduced in spaCy v2.0

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

NameTypeDescription
nameunicodeName of the extension.
returnstuple A (default, method, getter, setter) tuple of the extension.

Doc.has_extension
classmethod
v2.0 This feature is new and was introduced in spaCy v2.0

Check whether an extension has been registered on the Doc class.

NameTypeDescription
nameunicodeName of the extension to check.
returnsboolWhether the extension has been registered.

Doc.remove_extension
classmethod
v2.0.12 This feature is new and was introduced in spaCy v2.0.12

Remove a previously registered extension.

NameTypeDescription
nameunicodeName of the extension.
returnstuple A (default, method, getter, setter) tuple of the removed extension.

Doc.char_span
method
v2.0 This feature is new and was introduced in spaCy v2.0

Create a Span object from the slice doc.text[start : end]. Returns None if the character indices don't map to a valid span.

NameTypeDescription
startintThe index of the first character of the span.
endintThe index of the first character after the span.
labeluint64 / unicodeA label to attach to the Span, e.g. for named entities.
vectornumpy.ndarray[ndim=1, dtype='float32']A meaning representation of the span.
returnsSpanThe newly constructed object or None.

Doc.similarity
method
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: vectors.

Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.

NameTypeDescription
other- The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects.
returnsfloatA scalar similarity score. Higher is more similar.

Doc.count_by
method

Count the frequencies of a given attribute. Produces a dict of {attr (int): count (ints)} frequencies, keyed by the values of the given attribute ID.

NameTypeDescription
attr_idintThe attribute ID
returnsdictA dictionary mapping attributes to integer counts.

Doc.get_lca_matrix
method

Calculates the lowest common ancestor matrix for a given Doc. Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found, e.g. if span excludes a necessary ancestor.

NameTypeDescription
returnsnumpy.ndarray[ndim=2, dtype='int32']The lowest common ancestor matrix of the Doc.

Doc.to_array
method

Export given token attributes to a numpy ndarray. If attr_ids is a sequence of M attributes, the output array will be of shape (N, M), where N is the length of the Doc (in tokens). If attr_ids is a single attribute, the output shape will be (N,). You can specify attributes by integer ID (e.g. spacy.attrs.LEMMA) or string name (e.g. 'LEMMA' or 'lemma'). The values will be 64-bit integers.

NameTypeDescription
attr_idslist or int or stringA list of attributes (int IDs or string names) or a single attribute (int ID or string name)
returnsnumpy.ndarray[ndim=2, dtype='uint64'] ornumpy.ndarray[ndim=1, dtype='uint64'] or The exported attributes as a 2D numpy array, with one row per token and one column per attribute (when attr_ids is a list), or as a 1D numpy array, with one item per attribute (when attr_ids is a single value).

Doc.from_array
method

Load attributes from a numpy array. Write to a Doc object, from an (M, N) array of attributes.

NameTypeDescription
attrsintsA list of attribute ID ints.
arraynumpy.ndarray[ndim=2, dtype='int32']The attribute values to load.
returnsDocItself.

Doc.to_disk
method
v2.0 This feature is new and was introduced in spaCy v2.0

Save the current state to a directory.

NameTypeDescription
pathunicode or Path A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects.

Doc.from_disk
method
v2.0 This feature is new and was introduced in spaCy v2.0

Loads state from a directory. Modifies the object in place and returns it.

NameTypeDescription
pathunicode or Path A path to a directory. Paths may be either strings or Path-like objects.
returnsDocThe modified Doc object.

Doc.to_bytes
method

Serialize, i.e. export the document contents to a binary string.

NameTypeDescription
returnsbytes A losslessly serialized copy of the Doc, including all annotations.

Doc.from_bytes
method

Deserialize, i.e. import the document contents from a binary string.

NameTypeDescription
databytesThe string to load from.
returnsDocThe Doc object.

Doc.merge
method

Retokenize the document, such that the span at doc.text[start_idx : end_idx] is merged into a single token. If start_idx and end_idx do not mark start and end token boundaries, the document remains unchanged.

NameTypeDescription
start_idxintThe character index of the start of the slice to merge.
end_idxintThe character index after the end of the slice to merge.
**attributes- Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span.
returnsToken The newly merged token, or None if the start and end indices did not fall at token boundaries

Returns the parse trees in JSON (dict) format. Especially useful for web applications.

NameTypeDescription
lightboolDon't include lemmas or entities.
flatboolDon't include arcs or modifiers.
returnsdictParse tree as dict.

Doc.ents
property
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: NER.

Iterate over the entities in the document. Yields named-entity Span objects, if the entity recognizer has been applied to the document.

NameTypeDescription
yieldsSpanEntities in the document.

Doc.noun_chunks
property
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: parse.

Iterate over the base noun phrases in the document. Yields base noun-phrase Span objects, if the document has been syntactically parsed. A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses.

NameTypeDescription
yieldsSpanNoun chunks in the document.

Doc.sents
property
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: parse.

Iterate over the sentences in the document. Sentence spans have no label. To improve accuracy on informal texts, spaCy calculates sentence boundaries from the syntactic dependency parse. If the parser is disabled, the sents iterator will be unavailable.

NameTypeDescription
yieldsSpanSentences in the document.

Doc.has_vector
property
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: vectors.

A boolean value indicating whether a word vector is associated with the object.

NameTypeDescription
returnsboolWhether the document has a vector data attached.

Doc.vector
property
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: vectors.

A real-valued meaning representation. Defaults to an average of the token vectors.

NameTypeDescription
returnsnumpy.ndarray[ndim=1, dtype='float32']A 1D numpy array representing the document's semantics.

Doc.vector_norm
property
Needs model To use this functionality, spaCy needs a model to be installed that supports the following capabilities: vectors.

The L2 norm of the document's vector representation.

NameTypeDescription
returnsfloatThe L2 norm of the vector representation.

Attributes

NameTypeDescription
textunicodeA unicode representation of the document text.
text_with_wsunicode An alias of Doc.text, provided for duck-type compatibility with Span and Token.
memPoolThe document's local memory heap, for all C data it owns.
vocabVocabThe store of lexical types.
tensor
v2.0 This feature is new and was introduced in spaCy v2.0
objectContainer for dense vector representations.
cats
v2.0 This feature is new and was introduced in spaCy v2.0
dictionary Maps either a label to a score for categories applied to whole document, or (start_char, end_char, label) to score for categories applied to spans. start_char and end_char should be character offsets, label can be either a string or an integer ID, and score should be a float.
user_data-A generic storage area, for user custom data.
is_taggedbool A flag indicating that the document has been part-of-speech tagged.
is_parsedboolA flag indicating that the document has been syntactically parsed.
is_sentencedbool A flag indicating that sentence boundaries have been applied to the document.
sentimentfloatThe document's positivity/negativity score, if available.
user_hooksdict A dictionary that allows customisation of the Doc's properties.
user_token_hooksdict A dictionary that allows customisation of properties of Token children.
user_span_hooksdict A dictionary that allows customisation of properties of Span children.
_Underscore User space for adding custom attribute extensions.