
spaCy-SetFit
spaCy-SetFit is a Python library that extends spaCy’s text categorization capabilities by incorporating SetFit for few-shot classification. It allows you to train a text categorizer using a intuitive dictionary.
The library integrates with spaCy’s pipeline architecture, enabling easy integration and configuration of the text categorizer component. You can provide a training dataset containing inlier and outlier examples, and spaCy-SetFit will use the paraphrase-MiniLM-L3-v2 model for training the text categorizer with SetFit. Once trained, you can use the categorizer to classify new text and obtain category probabilities.
Example
import spacy # Create some example data train_dataset = { "inlier": [ "Text about furniture", "Couches, benches and televisions.", "I really need to get a new sofa." ], "outlier": [ "Text about kitchen equipment", "This text is about politics", "Comments about AI and stuff." ] } # Load the spaCy language model: nlp = spacy.load("en_core_web_sm") # Add the "spacy_setfit" pipeline component to the spaCy model, and configure it with SetFit parameters: nlp.add_pipe("spacy_setfit", config={ "pretrained_model_name_or_path": "paraphrase-MiniLM-L3-v2", "setfit_trainer_args": { "train_dataset": train_dataset } }) doc = nlp("I really need to get a new sofa.") doc.cats # {'inlier': 0.902350975129, 'outlier': 0.097649024871}
Categories pipeline
Found a mistake or something isn't working?
If you've come across a universe project that isn't working or is incompatible with the reported spaCy version, let us know by opening a discussion thread.
Submit your project
If you have a project that you want the spaCy community to make use of, you can suggest it by submitting a pull request to the spaCy website repository. The Universe database is open-source and collected in a simple JSON file. For more details on the formats and available fields, see the documentation. Looking for inspiration your own spaCy plugin or extension? Check out the project idea section in Discussions.