Containers

Doc

class
A container for accessing linguistic annotations.

A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. The Doc object holds an array of TokenC] structs. The Python-level Token and Span objects are views of this array, i.e. they don’t own the data themselves.

Doc.__init__ method

Construct a Doc object. The most common way to get a Doc object is via the nlp object.

NameTypeDescription
vocabVocabA storage container for lexical types.
wordsiterableA list of strings to add to the container.
spacesiterableA list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True.

Doc.__getitem__ method

Get a Token object at position i, where i is an integer. Negative indexing is supported, and follows the usual Python semantics, i.e. doc[-2] is doc[len(doc) - 2].

NameTypeDescription
iintThe index of the token.

Get a Span object, starting at position start (token index) and ending at position end (token index). For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as Span objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics.

NameTypeDescription
start_endtupleThe slice of the document to get.

Doc.__iter__ method

Iterate over Token objects, from which the annotations can be easily accessed.

This is the main way of accessing Token objects, which are the main way annotations are accessed from Python. If faster-than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython.

NameTypeDescription

Doc.__len__ method

Get the number of tokens in the document.

NameTypeDescription

Doc.set_extension classmethodv2.0

Define a custom attribute on the Doc which becomes available via Doc._. For details, see the documentation on custom attributes.

NameTypeDescription
nameunicodeName of the attribute to set by the extension. For example, 'my_attr' will be available as doc._.my_attr.
default-Optional default value of the attribute if no getter or method is defined.
methodcallableSet a custom method on the object, for example doc._.compare(other_doc).
gettercallableGetter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute.
settercallableSetter function that takes the Doc and a value, and modifies the object. Is called when the user writes to the Doc._ attribute.
forceboolForce overwriting existing attribute.

Doc.get_extension classmethodv2.0

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

NameTypeDescription
nameunicodeName of the extension.

Doc.has_extension classmethodv2.0

Check whether an extension has been registered on the Doc class.

NameTypeDescription
nameunicodeName of the extension to check.

Doc.remove_extension classmethodv2.0.12

Remove a previously registered extension.

NameTypeDescription
nameunicodeName of the extension.

Doc.char_span methodv2.0

Create a Span object from the slice doc.text[start:end]. Returns None if the character indices don’t map to a valid span.

NameTypeDescription
startintThe index of the first character of the span.
endintThe index of the last character after the span.
labeluint64 / unicodeA label to attach to the Span, e.g. for named entities.
vectornumpy.ndarray[ndim=1, dtype='float32']A meaning representation of the span.

Doc.similarity methodNeeds model

Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.

NameTypeDescription
other-The object to compare with. By default, accepts Doc, Span, Token and Lexeme objects.

Doc.count_by method

Count the frequencies of a given attribute. Produces a dict of {attr (int): count (ints)} frequencies, keyed by the values of the given attribute ID.

NameTypeDescription
attr_idintThe attribute ID

Doc.get_lca_matrix method

Calculates the lowest common ancestor matrix for a given Doc. Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found, e.g. if span excludes a necessary ancestor.

NameTypeDescription

Doc.to_json methodv2.1

Convert a Doc to JSON. The format it produces will be the new format for the spacy train command (not implemented yet). If custom underscore attributes are specified, their values need to be JSON-serializable. They’ll be added to an "_" key in the data, e.g. "_": {"foo": "bar"}.

NameTypeDescription
underscorelistOptional list of string names of custom JSON-serializable doc._. attributes.

Doc.to_array method

Export given token attributes to a numpy ndarray. If attr_ids is a sequence of M attributes, the output array will be of shape (N, M), where N is the length of the Doc (in tokens). If attr_ids is a single attribute, the output shape will be (N,). You can specify attributes by integer ID (e.g. spacy.attrs.LEMMA) or string name (e.g. ‘LEMMA’ or ‘lemma’). The values will be 64-bit integers.

Returns a 2D array with one row per token and one column per attribute (when attr_ids is a list), or as a 1D numpy array, with one item per attribute (when attr_ids is a single value).

NameTypeDescription
attr_idslist or int or stringA list of attributes (int IDs or string names) or a single attribute (int ID or string name)

Doc.from_array method

Load attributes from a numpy array. Write to a Doc object, from an (M, N) array of attributes.

NameTypeDescription
attrslistA list of attribute ID ints.
arraynumpy.ndarray[ndim=2, dtype='int32']The attribute values to load.
excludelistString names of serialization fields to exclude.

Doc.to_disk methodv2.0

Save the current state to a directory.

NameTypeDescription
pathunicode / PathA path to a directory, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects.
excludelistString names of serialization fields to exclude.

Doc.from_disk methodv2.0

Loads state from a directory. Modifies the object in place and returns it.

NameTypeDescription
pathunicode / PathA path to a directory. Paths may be either strings or Path-like objects.
excludelistString names of serialization fields to exclude.

Doc.to_bytes method

Serialize, i.e. export the document contents to a binary string.

NameTypeDescription
excludelistString names of serialization fields to exclude.

Doc.from_bytes method

Deserialize, i.e. import the document contents from a binary string.

NameTypeDescription
databytesThe string to load from.
excludelistString names of serialization fields to exclude.

Doc.retokenize contextmanagerv2.1

Context manager to handle retokenization of the Doc. Modifications to the Doc’s tokenization are stored, and then made all at once when the context manager exits. This is much more efficient, and less error-prone. All views of the Doc (Span and Token) created before the retokenization are invalidated, although they may accidentally continue to work.

NameTypeDescription

Retokenizer.merge method

Mark a span for merging. The attrs will be applied to the resulting token (if they’re context-dependent token attributes like LEMMA or DEP) or to the underlying lexeme (if they’re context-independent lexical attributes like LOWER or IS_STOP). Writable custom extension attributes can be provided as a dictionary mapping attribute names to values as the "_" key.

NameTypeDescription
spanSpanThe span to merge.
attrsdictAttributes to set on the merged token.

Retokenizer.split method

Mark a token for splitting, into the specified orths. The heads are required to specify how the new subtokens should be integrated into the dependency tree. The list of per-token heads can either be a token in the original document, e.g. doc[2], or a tuple consisting of the token in the original document and its subtoken index. For example, (doc[3], 1) will attach the subtoken to the second subtoken of doc[3].

This mechanism allows attaching subtokens to other newly created subtokens, without having to keep track of the changing token indices. If the specified head token will be split within the retokenizer block and no subtoken index is specified, it will default to 0. Attributes to set on subtokens can be provided as a list of values. They’ll be applied to the resulting token (if they’re context-dependent token attributes like LEMMA or DEP) or to the underlying lexeme (if they’re context-independent lexical attributes like LOWER or IS_STOP).

NameTypeDescription
tokenTokenThe token to split.
orthslistThe verbatim text of the split tokens. Needs to match the text of the original token.
headslistList of token or (token, subtoken) tuples specifying the tokens to attach the newly split subtokens to.
attrsdictAttributes to set on all split tokens. Attribute names mapped to list of per-token attribute values.

Doc.merge method

Retokenize the document, such that the span at doc.text[start_idx : end_idx] is merged into a single token. If start_idx and end_idx do not mark start and end token boundaries, the document remains unchanged.

NameTypeDescription
start_idxintThe character index of the start of the slice to merge.
end_idxintThe character index after the end of the slice to merge.
**attributes-Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span.

Doc.ents propertyNeeds model

The named entities in the document. Returns a tuple of named entity Span objects, if the entity recognizer has been applied.

NameTypeDescription

Doc.noun_chunks propertyNeeds model

Iterate over the base noun phrases in the document. Yields base noun-phrase Span objects, if the document has been syntactically parsed. A base noun phrase, or “NP chunk”, is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses.

NameTypeDescription

Doc.sents propertyNeeds model

Iterate over the sentences in the document. Sentence spans have no label. To improve accuracy on informal texts, spaCy calculates sentence boundaries from the syntactic dependency parse. If the parser is disabled, the sents iterator will be unavailable.

NameTypeDescription

Doc.has_vector propertyNeeds model

A boolean value indicating whether a word vector is associated with the object.

NameTypeDescription

Doc.vector propertyNeeds model

A real-valued meaning representation. Defaults to an average of the token vectors.

NameTypeDescription

Doc.vector_norm propertyNeeds model

The L2 norm of the document’s vector representation.

NameTypeDescription

Attributes

NameTypeDescription
textunicodeA unicode representation of the document text.
text_with_wsunicodeAn alias of Doc.text, provided for duck-type compatibility with Span and Token.
memPoolThe document’s local memory heap, for all C data it owns.
vocabVocabThe store of lexical types.
tensor v2.0objectContainer for dense vector representations.
cats v2.0dictionaryMaps either a label to a score for categories applied to whole document, or (start_char, end_char, label) to score for categories applied to spans. start_char and end_char should be character offsets, label can be either a string or an integer ID, and score should be a float.
user_data-A generic storage area, for user custom data.
lang v2.1intLanguage of the document’s vocabulary.
lang_ v2.1unicodeLanguage of the document’s vocabulary.
is_taggedboolA flag indicating that the document has been part-of-speech tagged.
is_parsedboolA flag indicating that the document has been syntactically parsed.
is_sentencedboolA flag indicating that sentence boundaries have been applied to the document.
is_nered v2.1boolA flag indicating that named entities have been set. Will return True if any of the tokens has an entity tag set, even if the others are unknown.
sentimentfloatThe document’s positivity/negativity score, if available.
user_hooksdictA dictionary that allows customization of the Doc’s properties.
user_token_hooksdictA dictionary that allows customization of properties of Token children.
user_span_hooksdictA dictionary that allows customization of properties of Span children.
_UnderscoreUser space for adding custom attribute extensions.

Serialization fields

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

NameDescription
textThe value of the Doc.text attribute.
sentimentThe value of the Doc.sentiment attribute.
tensorThe value of the Doc.tensor attribute.
user_dataThe value of the Doc.user_data dictionary.
user_data_keysThe keys of the Doc.user_data dictionary.
user_data_valuesThe values of the Doc.user_data dictionary.