Containers

Doc

class
A container for accessing linguistic annotations.

A Doc is a sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings. The Doc object holds an array of TokenC structs. The Python-level Token and Span objects are views of this array, i.e. they don’t own the data themselves.

Doc.__init__ method

Construct a Doc object. The most common way to get a Doc object is via the nlp object.

NameDescription
vocabA storage container for lexical types. Vocab
wordsA list of strings or integer hash values to add to the document as words. Optional[List[Union[str,int]]]
spacesA list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True. Optional[List[bool]]
keyword-only
user_dataOptional extra data to attach to the Doc. Dict
tags v3.0A list of strings, of the same length as words, to assign as token.tag for each word. Defaults to None. Optional[List[str]]
pos v3.0A list of strings, of the same length as words, to assign as token.pos for each word. Defaults to None. Optional[List[str]]
morphs v3.0A list of strings, of the same length as words, to assign as token.morph for each word. Defaults to None. Optional[List[str]]
lemmas v3.0A list of strings, of the same length as words, to assign as token.lemma for each word. Defaults to None. Optional[List[str]]
heads v3.0A list of values, of the same length as words, to assign as the head for each word. Head indices are the absolute position of the head in the Doc. Defaults to None. Optional[List[int]]
deps v3.0A list of strings, of the same length as words, to assign as token.dep for each word. Defaults to None. Optional[List[str]]
sent_starts v3.0A list of values, of the same length as words, to assign as token.is_sent_start. Will be overridden by heads if heads is provided. Defaults to None. Optional[List[Union[bool, int, None]]]
ents v3.0A list of strings, of the same length of words, to assign the token-based IOB tag. Defaults to None. Optional[List[str]]

Doc.__getitem__ method

Get a Token object at position i, where i is an integer. Negative indexing is supported, and follows the usual Python semantics, i.e. doc[-2] is doc[len(doc) - 2].

NameDescription
iThe index of the token. int

Get a Span object, starting at position start (token index) and ending at position end (token index). For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as Span objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics.

NameDescription
start_endThe slice of the document to get. Tuple[int, int]

Doc.__iter__ method

Iterate over Token objects, from which the annotations can be easily accessed.

This is the main way of accessing Token objects, which are the main way annotations are accessed from Python. If faster-than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython.

NameDescription

Doc.__len__ method

Get the number of tokens in the document.

NameDescription

Doc.set_extension classmethod

Define a custom attribute on the Doc which becomes available via Doc._. For details, see the documentation on custom attributes.

NameDescription
nameName of the attribute to set by the extension. For example, "my_attr" will be available as doc._.my_attr. str
defaultOptional default value of the attribute if no getter or method is defined. Optional[Any]
methodSet a custom method on the object, for example doc._.compare(other_doc). Optional[Callable[[Doc,], Any]]
getterGetter function that takes the object and returns an attribute value. Is called when the user accesses the ._ attribute. Optional[Callable[[Doc], Any]]
setterSetter function that takes the Doc and a value, and modifies the object. Is called when the user writes to the Doc._ attribute. Optional[Callable[[Doc, Any], None]]
forceForce overwriting existing attribute. bool

Doc.get_extension classmethod

Look up a previously registered extension by name. Returns a 4-tuple (default, method, getter, setter) if the extension is registered. Raises a KeyError otherwise.

NameDescription
nameName of the extension. str

Doc.has_extension classmethod

Check whether an extension has been registered on the Doc class.

NameDescription
nameName of the extension to check. str

Doc.remove_extension classmethod

Remove a previously registered extension.

NameDescription
nameName of the extension. str

Doc.char_span method

Create a Span object from the slice doc.text[start_idx:end_idx]. Returns None if the character indices don’t map to a valid span using the default alignment mode `“strict”.

NameDescription
startThe index of the first character of the span. int
endThe index of the last character after the span. int
labelA label to attach to the span, e.g. for named entities. Union[int, str]
kb_idAn ID from a knowledge base to capture the meaning of a named entity. Union[int, str]
vectorA meaning representation of the span. numpy.ndarray[ndim=1, dtype=float32]
alignment_modeHow character indices snap to token boundaries. Options: "strict" (no snapping), "contract" (span of all tokens completely within the character span), "expand" (span of all tokens at least partially covered by the character span). Defaults to "strict". str
span_id v3.3.1An identifier to associate with the span. Union[int, str]

Doc.set_ents methodv3.0

Set the named entities in the document.

NameDescription
entitiesSpans with labels to set as entities. List[Span]
keyword-only
blockedSpans to set as “blocked” (never an entity) for spacy’s built-in NER component. Other components may ignore this setting. Optional[List[Span]]
missingSpans with missing/unknown entity information. Optional[List[Span]]
outsideSpans outside of entities (O in IOB). Optional[List[Span]]
defaultHow to set entity annotation for tokens outside of any provided spans. Options: "blocked", "missing", "outside" and "unmodified" (preserve current state). Defaults to "outside". str

Doc.similarity methodNeeds model

Make a semantic similarity estimate. The default estimate is cosine similarity using an average of word vectors.

NameDescription
otherThe object to compare with. By default, accepts Doc, Span, Token and Lexeme objects. Union[Doc,Span,Token,Lexeme]

Doc.count_by method

Count the frequencies of a given attribute. Produces a dict of {attr (int): count (ints)} frequencies, keyed by the values of the given attribute ID.

NameDescription
attr_idThe attribute ID. int

Doc.get_lca_matrix method

Calculates the lowest common ancestor matrix for a given Doc. Returns LCA matrix containing the integer index of the ancestor, or -1 if no common ancestor is found, e.g. if span excludes a necessary ancestor.

NameDescription

Doc.has_annotation method

Check whether the doc contains annotation on a Token attribute.

NameDescription
attrThe attribute string name or int ID. Union[int, str]
keyword-only
require_completeWhether to check that the attribute is set on every token in the doc. Defaults to False. bool

Doc.to_array method

Export given token attributes to a numpy ndarray. If attr_ids is a sequence of M attributes, the output array will be of shape (N, M), where N is the length of the Doc (in tokens). If attr_ids is a single attribute, the output shape will be (N,). You can specify attributes by integer ID (e.g. spacy.attrs.LEMMA) or string name (e.g. “LEMMA” or “lemma”). The values will be 64-bit integers.

Returns a 2D array with one row per token and one column per attribute (when attr_ids is a list), or as a 1D numpy array, with one item per attribute (when attr_ids is a single value).

NameDescription
attr_idsA list of attributes (int IDs or string names) or a single attribute (int ID or string name). Union[int, str, List[Union[int, str]]]

Doc.from_array method

Load attributes from a numpy array. Write to a Doc object, from an (M, N) array of attributes.

NameDescription
attrsA list of attribute ID ints. List[int]
arrayThe attribute values to load. numpy.ndarray[ndim=2, dtype=int32]
excludeString names of serialization fields to exclude. Iterable[str]

Doc.from_docs staticmethodv3.0

Concatenate multiple Doc objects to form a new one. Raises an error if the Doc objects do not all share the same Vocab.

NameDescription
docsA list of Doc objects. List[Doc]
ensure_whitespaceInsert a space between two adjacent docs whenever the first doc does not end in whitespace. bool
attrsOptional list of attribute ID ints or attribute name strings. Optional[List[Union[str, int]]]
keyword-only
exclude v3.3String names of Doc attributes to exclude. Supported: spans, tensor, user_data. Iterable[str]

Doc.to_disk method

Save the current state to a directory.

NameDescription
pathA path to a directory, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects. Union[str,Path]
keyword-only
excludeString names of serialization fields to exclude. Iterable[str]

Doc.from_disk method

Loads state from a directory. Modifies the object in place and returns it.

NameDescription
pathA path to a directory. Paths may be either strings or Path-like objects. Union[str,Path]
keyword-only
excludeString names of serialization fields to exclude. Iterable[str]

Doc.to_bytes method

Serialize, i.e. export the document contents to a binary string.

NameDescription
keyword-only
excludeString names of serialization fields to exclude. Iterable[str]

Doc.from_bytes method

Deserialize, i.e. import the document contents from a binary string.

NameDescription
dataThe string to load from. bytes
keyword-only
excludeString names of serialization fields to exclude. Iterable[str]

Doc.to_json method

Serializes a document to JSON. Note that this is format differs from the deprecated JSON training format.

NameDescription
underscoreOptional list of string names of custom Doc attributes. Attribute values need to be JSON-serializable. Values will be added to an "_" key in the data, e.g. "_": {"foo": "bar"}. Optional[List[str]]

Doc.from_json methodv3.3.1

Deserializes a document from JSON, i.e. generates a document from the provided JSON data as generated by Doc.to_json().

NameDescription
doc_jsonThe Doc data in JSON format from Doc.to_json. Dict[str, Any]
keyword-only
validateWhether to validate the JSON input against the expected schema for detailed debugging. Defaults to False. bool

Doc.retokenize contextmanager

Context manager to handle retokenization of the Doc. Modifications to the Doc’s tokenization are stored, and then made all at once when the context manager exits. This is much more efficient, and less error-prone. All views of the Doc (Span and Token) created before the retokenization are invalidated, although they may accidentally continue to work.

NameDescription

Retokenizer.merge method

Mark a span for merging. The attrs will be applied to the resulting token (if they’re context-dependent token attributes like LEMMA or DEP) or to the underlying lexeme (if they’re context-independent lexical attributes like LOWER or IS_STOP). Writable custom extension attributes can be provided using the "_" key and specifying a dictionary that maps attribute names to values.

NameDescription
spanThe span to merge. Span
attrsAttributes to set on the merged token. Dict[Union[str, int], Any]

Retokenizer.split method

Mark a token for splitting, into the specified orths. The heads are required to specify how the new subtokens should be integrated into the dependency tree. The list of per-token heads can either be a token in the original document, e.g. doc[2], or a tuple consisting of the token in the original document and its subtoken index. For example, (doc[3], 1) will attach the subtoken to the second subtoken of doc[3].

This mechanism allows attaching subtokens to other newly created subtokens, without having to keep track of the changing token indices. If the specified head token will be split within the retokenizer block and no subtoken index is specified, it will default to 0. Attributes to set on subtokens can be provided as a list of values. They’ll be applied to the resulting token (if they’re context-dependent token attributes like LEMMA or DEP) or to the underlying lexeme (if they’re context-independent lexical attributes like LOWER or IS_STOP).

NameDescription
tokenThe token to split. Token
orthsThe verbatim text of the split tokens. Needs to match the text of the original token. List[str]
headsList of token or (token, subtoken) tuples specifying the tokens to attach the newly split subtokens to. List[Union[Token, Tuple[Token, int]]]
attrsAttributes to set on all split tokens. Attribute names mapped to list of per-token attribute values. Dict[Union[str, int], List[Any]]

Doc.ents propertyNeeds model

The named entities in the document. Returns a tuple of named entity Span objects, if the entity recognizer has been applied.

NameDescription

Doc.spans property

A dictionary of named span groups, to store and access additional span annotations. You can write to it by assigning a list of Span objects or a SpanGroup to a given key.

NameDescription

Doc.cats propertyNeeds model

Maps a label to a score for categories applied to the document. Typically set by the TextCategorizer.

NameDescription

Doc.noun_chunks propertyNeeds model

Iterate over the base noun phrases in the document. Yields base noun-phrase Span objects, if the document has been syntactically parsed. A base noun phrase, or “NP chunk”, is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses.

To customize the noun chunk iterator in a loaded pipeline, modify nlp.vocab.get_noun_chunks. If the noun_chunk syntax iterator has not been implemented for the given language, a NotImplementedError is raised.

NameDescription

Doc.sents propertyNeeds model

Iterate over the sentences in the document. Sentence spans have no label.

This property is only available when sentence boundaries have been set on the document by the parser, senter, sentencizer or some custom function. It will raise an error otherwise.

NameDescription

Doc.has_vector propertyNeeds model

A boolean value indicating whether a word vector is associated with the object.

NameDescription

Doc.vector propertyNeeds model

A real-valued meaning representation. Defaults to an average of the token vectors.

NameDescription

Doc.vector_norm propertyNeeds model

The L2 norm of the document’s vector representation.

NameDescription

Attributes

NameDescription
textA string representation of the document text. str
text_with_wsAn alias of Doc.text, provided for duck-type compatibility with Span and Token. str
memThe document’s local memory heap, for all C data it owns. cymem.Pool
vocabThe store of lexical types. Vocab
tensorContainer for dense vector representations. numpy.ndarray
user_dataA generic storage area, for user custom data. Dict[str, Any]
langLanguage of the document’s vocabulary. int
lang_Language of the document’s vocabulary. str
sentimentThe document’s positivity/negativity score, if available. float
user_hooksA dictionary that allows customization of the Doc’s properties. Dict[str, Callable]
user_token_hooksA dictionary that allows customization of properties of Token children. Dict[str, Callable]
user_span_hooksA dictionary that allows customization of properties of Span children. Dict[str, Callable]
has_unknown_spacesWhether the document was constructed without known spacing between tokens (typically when created from gold tokenization). bool
_User space for adding custom attribute extensions. Underscore

Serialization fields

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

NameDescription
textThe value of the Doc.text attribute.
sentimentThe value of the Doc.sentiment attribute.
tensorThe value of the Doc.tensor attribute.
user_dataThe value of the Doc.user_data dictionary.
user_data_keysThe keys of the Doc.user_data dictionary.
user_data_valuesThe values of the Doc.user_data dictionary.