Corpus

classv3

An annotated corpus

This class manages annotated corpora and can be used for training and development datasets in the DocBin (.spacy) format. To customize the data loading during training, you can register your own data readers and batchers. Also see the usage guide on data utilities for more details and examples.

Config and implementation

spacy.Corpus.v1 is a registered function that creates a Corpus of training or evaluation data. It takes the same arguments as the Corpus class and returns a callable that yields Example objects. You can replace it with your own registered function in the @readers registry to customize the data loading and streaming.

Name	Description
`path`	The directory or filename to read from. Expects data in spaCy’s binary `.spacy` format. Path
`gold_preproc`	Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See `Corpus` for details. bool
`max_length`	Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. int
`limit`	Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. int
`augmenter`	Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don’t have smart-quotes, or only have smart quotes, etc. Defaults to `None`. Optional[Callable]

explosion/spaCy/master/spacy/training/corpus.py

Corpus.init method

Create a Corpus for iterating Example objects from a file or directory of .spacy data files. The gold_preproc setting lets you specify whether to set up the Example object with gold-standard sentences and tokens for the predictions. Gold preprocessing helps the annotations align to the tokenization, and may result in sequences of more consistent length. However, it may reduce runtime accuracy due to train/test skew.

Name	Description
`path`	The directory or filename to read from. Union[str,Path]
keyword-only
`gold_preproc`	Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. bool
`max_length`	Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. int
`limit`	Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. int
`augmenter`	Optional data augmentation callback. Callable[[Language,Example], Iterable[Example]]
`shuffle`	Whether to shuffle the examples. Defaults to `False`. bool

Corpus.call method

Yield examples from the data.

Name	Description
`nlp`	The current `nlp` object. Language
YIELDS	The examples. Example

JsonlCorpus class

Iterate Doc objects from a file or directory of JSONL (newline-delimited JSON) formatted raw text files. Can be used to read the raw text corpus for language model pretraining from a JSONL file.

Example

JsonlCorpus.init method

Initialize the reader.

Name	Description
`path`	The directory or filename to read from. Expects newline-delimited JSON with a key `"text"` for each record. Union[str,Path]
keyword-only
`min_length`	Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. int
`max_length`	Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. int
`limit`	Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. int

JsonlCorpus.call method

Yield examples from the data.

Name	Description
`nlp`	The current `nlp` object. Language
YIELDS	The examples. Example

PlainTextCorpus classv3.5.1

Iterate over documents from a plain text file. Can be used to read the raw text corpus for language model pretraining. The expected file format is:

UTF-8 encoding
One document per line
Blank lines are ignored.

Example

PlainTextCorpus.init method

Initialize the reader.

Name	Description
`path`	The directory or filename to read from. Expects newline-delimited documents in UTF8 format. Union[str,Path]
keyword-only
`min_length`	Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. int
`max_length`	Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. int

PlainTextCorpus.call method

Yield examples from the data.

Name	Description
`nlp`	The current `nlp` object. Language
YIELDS	The examples. Example

Suggest edits

Other

Config and implementation

Corpus.__init__ method

Corpus.__call__ method

JsonlCorpus class

Example

JsonlCorpus.__init__ method

JsonlCorpus.__call__ method

PlainTextCorpus classv3.5.1

Example

PlainTextCorpus.__init__ method

PlainTextCorpus.__call__ method

Corpus.init method

Corpus.call method

JsonlCorpus.init method

JsonlCorpus.call method

PlainTextCorpus.init method

PlainTextCorpus.call method