
skweak
skweak
brings the power of weak supervision to NLP tasks, and in particular sequence labelling and text classification. Instead of annotating documents by hand, skweak
allows you to define labelling functions to automatically label your documents, and then aggregate their results using a statistical model that estimates the accuracy and confusions of each labelling function.
Example
import spacy, re from skweak import heuristics, gazetteers, aggregation, utils # LF 1: heuristic to detect occurrences of MONEY entities def money_detector(doc): for tok in doc[1:]: if tok.text[0].isdigit() and tok.nbor(-1).is_currency: yield tok.i-1, tok.i+1, 'MONEY' lf1 = heuristics.FunctionAnnotator('money', money_detector) # LF 2: detection of years with a regex lf2= heuristics.TokenConstraintAnnotator ('years', lambda tok: re.match('(19|20)\d{2}$', tok.text), 'DATE') # LF 3: a gazetteer with a few names NAMES = [('Barack', 'Obama'), ('Donald', 'Trump'), ('Joe', 'Biden')] trie = gazetteers.Trie(NAMES) lf3 = gazetteers.GazetteerAnnotator('presidents', {'PERSON':trie}) # We create a corpus (here with a single text) nlp = spacy.load('en_core_web_sm') doc = nlp('Donald Trump paid $750 in federal income taxes in 2016') # apply the labelling functions doc = lf3(lf2(lf1(doc))) # and aggregate them hmm = aggregation.HMM('hmm', ['PERSON', 'DATE', 'MONEY']) hmm.fit_and_aggregate([doc]) # we can then visualise the final result (in Jupyter) utils.display_entities(doc, 'hmm')
GitHubNorskRegnesentral/skweak
Categories pipeline
standalone
research
training
Found a mistake or something isn't working?
If you've come across a universe project that isn't working or is incompatible with the reported spaCy version, let us know by opening a discussion thread.
Submit your project
If you have a project that you want the spaCy community to make use of, you can suggest it by submitting a pull request to the spaCy website repository. The Universe database is open-source and collected in a simple JSON file. For more details on the formats and available fields, see the documentation. Looking for inspiration your own spaCy plugin or extension? Check out the project idea section in Discussions.