Containers

DocBin

class
Pack Doc objects for binary serialization

The DocBin class lets you efficiently serialize the information from a collection of Doc objects. You can control which information is serialized by passing a list of attribute IDs, and optionally also specify whether the user data is serialized. The DocBin is faster and produces smaller data sizes than pickle, and allows you to deserialize without executing arbitrary Python code. A notable downside to this format is that you can’t easily extract just one document from the DocBin. The serialization format is gzipped msgpack, where the msgpack object has the following structure:

msgpack object structure

Strings for the words, tags, labels etc are represented by 64-bit hashes in the token data, and every string that occurs at least once is passed via the strings object. This means the storage is more efficient if you pack more documents together, because you have less duplication in the strings. For usage examples, see the docs on serializing Doc objects.

DocBin.__init__ method

Create a DocBin object to hold serialized annotations.

ArgumentDescription
attrsList of attributes to serialize. ORTH (hash of token text) and SPACY (whether the token is followed by whitespace) are always serialized, so they’re not required. Defaults to ("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH", "POS"). Iterable[str]
store_user_dataWhether to write the Doc.user_data and the values of custom extension attributes to file/bytes. Defaults to False. bool
docsDoc objects to add on initialization. Iterable[Doc]

DocBin.__len__ method

Get the number of Doc objects that were added to the DocBin.

ArgumentDescription

DocBin.add method

Add a Doc’s annotations to the DocBin for serialization.

ArgumentDescription
docThe Doc object to add. Doc

DocBin.get_docs method

Recover Doc objects from the annotations, using the given vocab.

ArgumentDescription
vocabThe shared vocab. Vocab

DocBin.merge method

Extend the annotations of this DocBin with the annotations from another. Will raise an error if the pre-defined attrs of the two DocBins don’t match.

ArgumentDescription
otherThe DocBin to merge into the current bin. DocBin

DocBin.to_bytes method

Serialize the DocBin’s annotations to a bytestring.

ArgumentDescription

DocBin.from_bytes method

Deserialize the DocBin’s annotations from a bytestring.

ArgumentDescription
bytes_dataThe data to load from. bytes

DocBin.to_disk methodv3.0

Save the serialized DocBin to a file. Typically uses the .spacy extension and the result can be used as the input data for spacy train.

ArgumentDescription
pathThe file path, typically with the .spacy extension. Union[str,Path]

DocBin.from_disk methodv3.0

Load a serialized DocBin from a file. Typically uses the .spacy extension.

ArgumentDescription
pathThe file path, typically with the .spacy extension. Union[str,Path]