Pipeline

Sentencizer

class
String name:sentencizerTrainable:
Pipeline component for rule-based sentence boundary detection

A simple pipeline component to allow custom sentence boundary detection logic that doesn’t require the dependency parse. By default, sentence segmentation is performed by the DependencyParser, so the Sentencizer lets you implement a simpler, rule-based strategy that doesn’t require a statistical model to be loaded.

Config and implementation

The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the config argument on nlp.add_pipe or in your config.cfg for training.

SettingDescription
punct_charsOptional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to None. Optional[List[str]]
explosion/spaCy/master/spacy/pipeline/sentencizer.pyx
Can't fetch code example from GitHub :( Please use the link below to view the example. If you've come across a broken link, we always appreciate a pull request to the repository, or a report on the issue tracker. Thanks!

Sentencizer.__init__ method

Initialize the sentencizer.

NameDescription
keyword-only
punct_charsOptional custom list of punctuation characters that mark sentence ends. See below for defaults. Optional[List[str]]

punct_chars defaults

['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。']

Sentencizer.__call__ method

Apply the sentencizer on a Doc. Typically, this happens automatically after the component has been added to the pipeline using nlp.add_pipe.

NameDescription
docThe Doc object to process, e.g. the Doc in the pipeline. Doc

Sentencizer.pipe method

Apply the pipe to a stream of documents. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order.

NameDescription
streamA stream of documents. Iterable[Doc]
keyword-only
batch_sizeThe number of documents to buffer. Defaults to 128. int

Sentencizer.score methodv3.0

Score a batch of examples.

NameDescription
examplesThe examples to score. Iterable[Example]

Sentencizer.to_disk method

Save the sentencizer settings (punctuation characters) to a directory. Will create a file sentencizer.json. This also happens automatically when you save an nlp object with a sentencizer added to its pipeline.

NameDescription
pathA path to a JSON file, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects. Union[str, Path]

Sentencizer.from_disk method

Load the sentencizer settings from a file. Expects a JSON file. This also happens automatically when you load an nlp object or model with a sentencizer added to its pipeline.

NameDescription
pathA path to a JSON file. Paths may be either strings or Path-like objects. Union[str, Path]

Sentencizer.to_bytes method

Serialize the sentencizer settings to a bytestring.

NameDescription

Sentencizer.from_bytes method

Load the pipe from a bytestring. Modifies the object in place and returns it.

NameDescription
bytes_dataThe bytestring to load. bytes