DocBin
The DocBin
class lets you efficiently serialize the information from a
collection of Doc
objects. You can control which information is serialized by
passing a list of attribute IDs, and optionally also specify whether the user
data is serialized. The DocBin
is faster and produces smaller data sizes than
pickle, and allows you to deserialize without executing arbitrary Python code. A
notable downside to this format is that you can’t easily extract just one
document from the DocBin
. The serialization format is gzipped msgpack, where
the msgpack object has the following structure:
msgpack object structure
Strings for the words, tags, labels etc are represented by 64-bit hashes in the
token data, and every string that occurs at least once is passed via the strings
object. This means the storage is more efficient if you pack more documents
together, because you have less duplication in the strings. For usage examples,
see the docs on serializing Doc
objects.
DocBin.__init__ method
Create a DocBin
object to hold serialized annotations.
Argument | Description |
---|---|
attrs | List of attributes to serialize. ORTH (hash of token text) and SPACY (whether the token is followed by whitespace) are always serialized, so they’re not required. Defaults to ("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH", "POS") . Iterable[str] |
store_user_data | Whether to write the Doc.user_data and the values of custom extension attributes to file/bytes. Defaults to False . bool |
docs | Doc objects to add on initialization. Iterable[Doc] |
DocBin.__len__ method
Get the number of Doc
objects that were added to the DocBin
.
Argument | Description |
---|---|
RETURNS | The number of Doc s added to the DocBin . int |
DocBin.add method
Add a Doc
’s annotations to the DocBin
for serialization.
Argument | Description |
---|---|
doc | The Doc object to add. Doc |
DocBin.get_docs method
Recover Doc
objects from the annotations, using the given vocab.
Argument | Description |
---|---|
vocab | The shared vocab. Vocab |
YIELDS | The Doc objects. Doc |
DocBin.merge method
Extend the annotations of this DocBin
with the annotations from another. Will
raise an error if the pre-defined attrs
of the two DocBin
s don’t match.
Argument | Description |
---|---|
other | The DocBin to merge into the current bin. DocBin |
DocBin.to_bytes method
Serialize the DocBin
’s annotations to a bytestring.
Argument | Description |
---|---|
RETURNS | The serialized DocBin . bytes |
DocBin.from_bytes method
Deserialize the DocBin
’s annotations from a bytestring.
Argument | Description |
---|---|
bytes_data | The data to load from. bytes |
RETURNS | The loaded DocBin . DocBin |
DocBin.to_disk methodv3.0
Save the serialized DocBin
to a file. Typically uses the .spacy
extension
and the result can be used as the input data for
spacy train
.
Argument | Description |
---|---|
path | The file path, typically with the .spacy extension. Union[str,Path] |
DocBin.from_disk methodv3.0
Load a serialized DocBin
from a file. Typically uses the .spacy
extension.
Argument | Description |
---|---|
path | The file path, typically with the .spacy extension. Union[str,Path] |
RETURNS | The loaded DocBin . DocBin |