scikit

GoldParse
class

Collection for training annotations.

GoldParse.__init__
method

Create a GoldParse.

NameTypeDescription
docDocThe document the annotations refer to.
wordsiterableA sequence of unicode word strings.
tagsiterableA sequence of strings, representing tag annotations.
headsiterableA sequence of integers, representing syntactic head offsets.
depsiterableA sequence of strings, representing the syntactic relation types.
entitiesiterableA sequence of named entity annotations, either as BILUO tag strings, or as (start_char, end_char, label) tuples, representing the entity positions.
returnsGoldParseThe newly constructed object.

GoldParse.__len__
method

Get the number of gold-standard tokens.

NameTypeDescription
returnsintThe number of gold-standard tokens.

GoldParse.is_projective
property

Whether the provided syntactic annotations form a projective dependency tree.

NameTypeDescription
returnsboolWhether annotations form projective tree.

Attributes

NameTypeDescription
tagslistThe part-of-speech tag annotations.
headslistThe syntactic head annotations.
labelslistThe syntactic relation-type annotations.
entslistThe named entity annotations.
cand_to_goldlistThe alignment from candidate tokenization to gold tokenization.
gold_to_candlistThe alignment from gold tokenization to candidate tokenization.
cats
v2.0 This feature is new and was introduced in spaCy v2.0
list Entries in the list should be either a label, or a (start, end, label) triple. The tuple form is used for categories applied to spans of the document.

Utilities

gold.biluo_tags_from_offsets
function

Encode labelled spans into per-token tags, using the BILUO scheme (Begin/In/Last/Unit/Out).

Returns a list of unicode strings, describing the tags. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". The string "-" is used where the entity offsets don't align with the tokenization in the Doc object. The training algorithm will view these as missing values. O denotes a non-entity token. B denotes the beginning of a multi-token entity, I the inside of an entity of three or more tokens, and L the end of an entity of two or more tokens. U denotes a single-token entity.

NameTypeDescription
docDoc The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document.
entitiesiterable A sequence of (start, end, label) triples. start and end should be character-offset integers denoting the slice into the original string.
returnslist Unicode strings, describing the BILUO tags.

gold.offsets_from_biluo_tags

Encode per-token tags following the BILUO scheme into entity offsets.

NameTypeDescription
docDocThe document that the BILUO tags refer to.
entitiesiterable A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U".
returnslist A sequence of (start, end, label) triples. start and end will be character-offset integers denoting the slice into the original string.