Pipeline

Sentencizer

class
String name:sentencizerTrainable:
Pipeline component for rule-based sentence boundary detection

A simple pipeline component to allow custom sentence boundary detection logic that doesn’t require the dependency parse. By default, sentence segmentation is performed by the DependencyParser, so the Sentencizer lets you implement a simpler, rule-based strategy that doesn’t require a statistical model to be loaded.

Assigned Attributes

Calculated values will be assigned to Token.is_sent_start. The resulting sentences can be accessed using Doc.sents.

LocationValue
Token.is_sent_startA boolean value indicating whether the token starts a sentence. This will be either True or False for all tokens. bool
Doc.sentsAn iterator over sentences in the Doc, determined by Token.is_sent_start values. Iterator[Span]

Config and implementation

The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the config argument on nlp.add_pipe or in your config.cfg for training.

SettingDescription
punct_charsOptional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to None. Optional[List[str]]
explosion/spaCy/master/spacy/pipeline/sentencizer.pyx
Can't fetch code example from GitHub :( Please use the link below to view the example. If you've come across a broken link, we always appreciate a pull request to the repository, or a report on the issue tracker. Thanks!

Sentencizer.__init__ method

Initialize the sentencizer.

NameDescription
keyword-only
punct_charsOptional custom list of punctuation characters that mark sentence ends. See below for defaults. Optional[List[str]]

punct_chars defaults

['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。']

Sentencizer.__call__ method

Apply the sentencizer on a Doc. Typically, this happens automatically after the component has been added to the pipeline using nlp.add_pipe.

NameDescription
docThe Doc object to process, e.g. the Doc in the pipeline. Doc

Sentencizer.pipe method

Apply the pipe to a stream of documents. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order.

NameDescription
streamA stream of documents. Iterable[Doc]
keyword-only
batch_sizeThe number of documents to buffer. Defaults to 128. int

Sentencizer.score methodv3.0

Score a batch of examples.

NameDescription
examplesThe examples to score. Iterable[Example]

Sentencizer.to_disk method

Save the sentencizer settings (punctuation characters) to a directory. Will create a file sentencizer.json. This also happens automatically when you save an nlp object with a sentencizer added to its pipeline.

NameDescription
pathA path to a JSON file, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects. Union[str, Path]

Sentencizer.from_disk method

Load the sentencizer settings from a file. Expects a JSON file. This also happens automatically when you load an nlp object or model with a sentencizer added to its pipeline.

NameDescription
pathA path to a JSON file. Paths may be either strings or Path-like objects. Union[str, Path]

Sentencizer.to_bytes method

Serialize the sentencizer settings to a bytestring.

NameDescription

Sentencizer.from_bytes method

Load the pipe from a bytestring. Modifies the object in place and returns it.

NameDescription
bytes_dataThe bytestring to load. bytes