Emoji handling and meta data as a spaCy pipeline component

spaCy v2.0 extension and pipeline component for adding emoji meta data to Doc objects. Detects emoji consisting of one or more unicode characters, and can optionally merge multi-char emoji (combined pictures, emoji with skin tone modifiers) into one token. Human-readable emoji descriptions are added as a custom attribute, and an optional lookup table can be provided for your own descriptions. The extension sets the custom Doc, Token and Span attributes ._.is_emoji, ._.emoji_desc, ._.has_emoji and ._.emoji.


import spacy from spacymoji import Emoji nlp = spacy.load('en') emoji = Emoji(nlp) nlp.add_pipe(emoji, first=True) doc = nlp(u'This is a test 😻 👍🏿') assert doc._.has_emoji == True assert doc[2:5]._.has_emoji == True assert doc[0]._.is_emoji == False assert doc[4]._.is_emoji == True assert doc[5]._.emoji_desc == u'thumbs up dark skin tone' assert len(doc._.emoji) == 2 assert doc._.emoji[1] == (u'👍🏿', 5, u'thumbs up dark skin tone')
Author info

Ines Montani


Categories pipeline

Submit your project

If you have a project that you want the spaCy community to make use of, you can suggest it by submitting a pull request to the spaCy website repository. The Universe database is open-source and collected in a simple JSON file. For more details on the formats and available fields, see the documentation. Looking for inspiration your own spaCy plugin or extension? Check out the project idea label on the issue tracker.

Read the docsJSON source