Crosslingual Coreference

One multi-lingual coreference model to rule them all!

spaCy v3

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also data proved to be poorly annotated. Crosslingual Coreference therefore uses the assumption a trained model with English data and cross-lingual embeddings should work for other languages with similar sentence structure. Verified to work quite well for at least (EN, NL, DK, FR, DE).


import spacy import crosslingual_coreference text = """ Do not forget about Momofuku Ando! He created instant noodles in Osaka. At that location, Nissin was founded. Many students survived by eating these noodles, but they don't even know him.""" # use any model that has internal spacy embeddings nlp = spacy.load('en_core_web_sm') nlp.add_pipe( "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}) ) doc = nlp(text) print(doc._.coref_clusters) # Output # # [[[4, 5], [7, 7], [27, 27], [36, 36]], # [[12, 12], [15, 16]], # [[9, 10], [27, 28]], # [[22, 23], [31, 31]]] print(doc._.resolved_text) # Output # # Do not forget about Momofuku Ando! # Momofuku Ando created instant noodles in Osaka. # At Osaka, Nissin was founded. # Many students survived by eating instant noodles, # but Many students don't even know Momofuku Ando.

Author info

David Berenstein


Categories pipeline standalone

Found a mistake or something isn't working?

If you've come across a universe project that isn't working or is incompatible with the reported spaCy version, let us know by opening a discussion thread.

Submit your project

If you have a project that you want the spaCy community to make use of, you can suggest it by submitting a pull request to the spaCy website repository. The Universe database is open-source and collected in a simple JSON file. For more details on the formats and available fields, see the documentation. Looking for inspiration your own spaCy plugin or extension? Check out the project idea label on the issue tracker.

Read the docsJSON source