Scorer

class

Compute evaluation scores

The Scorer computes evaluation scores. It’s typically created by Language.evaluate. In addition, the Scorer provides a number of evaluation methods for evaluating Token and Doc attributes.

Scorer.init method

Create a new Scorer.

Name	Description
`nlp`	The pipeline to use for scoring, where each pipeline component may provide a scoring method. If none is provided, then a default pipeline is constructed using the `default_lang` and `default_pipeline` settings. Optional[Language]
`default_lang`	The language to use for a default pipeline if `nlp` is not provided. Defaults to `xx`. str
`default_pipeline`	The pipeline components to use for a default pipeline if `nlp` is not provided. Defaults to `("senter", "tagger", "morphologizer", "parser", "ner", "textcat")`. Iterable[string]
keyword-only
`**kwargs`	Any additional settings to pass on to the individual scoring methods. Any

Scorer.score method

Calculate the scores for a list of Example objects using the scoring methods provided by the components in the pipeline.

The returned Dict contains the scores provided by the individual pipeline components. For the scoring methods provided by the Scorer and used by the core pipeline components, the individual score names start with the Token or Doc attribute being scored:

token_acc, token_p, token_r, token_f
sents_p, sents_r, sents_f
tag_acc
pos_acc
morph_acc, morph_micro_p, morph_micro_r, morph_micro_f, morph_per_feat
lemma_acc
dep_uas, dep_las, dep_las_per_type
ents_p, ents_r ents_f, ents_per_type
spans_sc_p, spans_sc_r, spans_sc_f
cats_score (depends on config, description provided in cats_score_desc), cats_micro_p, cats_micro_r, cats_micro_f, cats_macro_p, cats_macro_r, cats_macro_f, cats_macro_auc, cats_f_per_type, cats_auc_per_type

Name	Description
`examples`	The `Example` objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
keyword-only
`per_component` v3.6	Whether to return the scores keyed by component name. Defaults to `False`. bool
RETURNS	A dictionary of scores. Dict[str, Union[float, Dict[str, float]]]

Scorer.score_tokenization staticmethodv3.0

Scores the tokenization:

token_acc: number of correct tokens / number of predicted tokens
token_p, token_r, token_f: precision, recall and F-score for token character spans

Docs with has_unknown_spaces are skipped during scoring.

| Name | Description | | ----------- | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | | examples | The Example objects holding both the predictions and the correct gold-standard annotations. Iterable[Example] | | RETURNS | Dict | A dictionary containing the scores token_acc, token_p, token_r, token_f. Dict[str, float]] |

Scorer.score_token_attr staticmethodv3.0

Scores a single token attribute. Tokens with missing values in the reference doc are skipped during scoring.

Name	Description
`examples`	The `Example` objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
`attr`	The attribute to score. str
keyword-only
`getter`	Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. Callable[[Token, str], Any]
`missing_values`	Attribute values to treat as missing annotation in the reference annotation. Defaults to `{0, None, ""}`. Set[Any]
RETURNS	A dictionary containing the score `{attr}_acc`. Dict[str, float]

Scorer.score_token_attr_per_feat staticmethodv3.0

Scores a single token attribute per feature for a token attribute in the Universal Dependencies FEATS format. Tokens with missing values in the reference doc are skipped during scoring.

Name	Description
`examples`	The `Example` objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
`attr`	The attribute to score. str
keyword-only
`getter`	Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. Callable[[Token, str], Any]
`missing_values`	Attribute values to treat as missing annotation in the reference annotation. Defaults to `{0, None, ""}`. Set[Any]
RETURNS	A dictionary containing the micro PRF scores under the key `{attr}_micro_p/r/f` and the per-feature PRF scores under `{attr}_per_feat`. Dict[str, Dict[str, float]]

Scorer.score_spans staticmethodv3.0

Returns PRF scores for labeled or unlabeled spans.

Name	Description
`examples`	The `Example` objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
`attr`	The attribute to score. str
keyword-only
`getter`	Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. Callable[[Doc, str], Iterable[Span]]
`has_annotation`	Defaults to `None`. If provided, `has_annotation(doc)` should return whether a `Doc` has annotation for this `attr`. Docs without annotation are skipped for scoring purposes. str
`labeled`	Defaults to `True`. If set to `False`, two spans will be considered equal if their start and end match, irrespective of their label. bool
`allow_overlap`	Defaults to `False`. Whether or not to allow overlapping spans. If set to `False`, the alignment will automatically resolve conflicts. bool
RETURNS	A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. Dict[str, Union[float, Dict[str, float]]]

Scorer.score_deps staticmethodv3.0

Calculate the UAS, LAS, and LAS per type scores for dependency parses. Tokens with missing values for the attr (typically dep) are skipped during scoring.

Name	Description
`examples`	The `Example` objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
`attr`	The attribute to score. str
keyword-only
`getter`	Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. Callable[[Token, str], Any]
`head_attr`	The attribute containing the head token. str
`head_getter`	Defaults to `getattr`. If provided, `head_getter(token, attr)` should return the head for an individual `Token`. Callable[[Doc, str],Token]
`ignore_labels`	Labels to ignore while scoring (e.g. `"punct"`). Iterable[str]
`missing_values`	Attribute values to treat as missing annotation in the reference annotation. Defaults to `{0, None, ""}`. Set[Any]
RETURNS	A dictionary containing the scores: `{attr}_uas`, `{attr}_las`, and `{attr}_las_per_type`. Dict[str, Union[float, Dict[str, float]]]

Scorer.score_cats staticmethodv3.0

Calculate PRF and ROC AUC scores for a doc-level attribute that is a dict containing scores for each label like Doc.cats. The returned dictionary contains the following scores:

{attr}_micro_p, {attr}_micro_r and {attr}_micro_f: each instance across each label is weighted equally
{attr}_macro_p, {attr}_macro_r and {attr}_macro_f: the average values across evaluations per label
{attr}_f_per_type and {attr}_auc_per_type: each contains a dictionary of scores, keyed by label
A final {attr}_score and corresponding {attr}_score_desc (text description)

The reported {attr}_score depends on the classification properties:

binary exclusive with positive label: {attr}_score is set to the F-score of the positive label
3+ exclusive classes, macro-averaged F-score: {attr}_score = {attr}_macro_f
multilabel, macro-averaged AUC: {attr}_score = {attr}_macro_auc

Name	Description
`examples`	The `Example` objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
`attr`	The attribute to score. str
keyword-only
`getter`	Defaults to `getattr`. If provided, `getter(doc, attr)` should return the cats for an individual `Doc`. Callable[[Doc, str], Dict[str, float]]
labels	The set of possible labels. Defaults to `[]`. Iterable[str]
`multi_label`	Whether the attribute allows multiple labels. Defaults to `True`. When set to `False` (exclusive labels), missing gold labels are interpreted as `0.0` and the threshold is set to `0.0`. bool
`positive_label`	The positive label for a binary task with exclusive classes. Defaults to `None`. Optional[str]
`threshold`	Cutoff to consider a prediction “positive”. Defaults to `0.5` for multi-label, and `0.0` (i.e. whatever’s highest scoring) otherwise. float
RETURNS	A dictionary containing the scores, with inapplicable scores as `None`. Dict[str, Optional[float]]

Scorer.score_links staticmethodv3.0

Returns PRF for predicted links on the entity level. To disentangle the performance of the NEL from the NER, this method only evaluates NEL links for entities that overlap between the gold reference and the predictions.

Name	Description
`examples`	The `Example` objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
keyword-only
`negative_labels`	The string values that refer to no annotation (e.g. “NIL”). Iterable[str]
RETURNS	A dictionary containing the scores. Dict[str, Optional[float]]

get_ner_prf v3.0

Compute micro-PRF and per-entity PRF scores.

Name	Description
`examples`	The `Example` objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]

score_coref_clusters experimental

Returns LEA (Moosavi and Strube, 2016) PRF scores for coreference clusters.

Name	Description
`examples`	The `Example` objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
keyword-only
`span_cluster_prefix`	The prefix used for spans representing coreference clusters. str
RETURNS	A dictionary containing the scores. Dict[str, Optional[float]]

score_span_predictions experimental

Return accuracy for reconstructions of spans from single tokens. Only exactly correct predictions are counted as correct, there is no partial credit for near answers. Used by the SpanResolver.

Name	Description
`examples`	The `Example` objects holding both the predictions and the correct gold-standard annotations. Iterable[Example]
keyword-only
`output_prefix`	The prefix used for spans representing the final predicted spans. str
RETURNS	A dictionary containing the scores. Dict[str, Optional[float]]

Suggest edits