Industrial-strength
Natural Language
Processing

Thousands of researchers are trying to make
computers understand text. They're succeeding.
We help you get their work out of papers and
into production.

Install spaCy
Latest Release: v0.100.6

Built for Production

Most AI software is built for research. Over the last ten years, we've used a lot of that software, and built some of it ourselves, especially for natural language processing (NLP). But the faster the research has moved, the more impatient we've become. We want to see advanced NLP technologies get out into great products, as the basis of great businesses. We built spaCy to make that happen.

Easy and Powerful

For any NLP task, there are always lots of competing algorithms. We don't believe in implementing them all and letting you choose. Instead, we just implement one – the best one. When better algorithms are developed, we can update the library without breaking your code or bloating the API. This approach makes spaCy both easier and more powerful than a pluggable architecture. spaCy also features a unique whole-document design. Where other NLP libraries rely on sentence detection as a pre-process, spaCy reads the whole document at once, making it much more robust to informal and poorly formatted text.

Permissive open-source license (MIT)

We think spaCy is valuable software, so we made it free, to raise its value even higher. Making spaCy open-source puts us on the same side – we can tell you everything about how it works, and let you run it however you like. We think the software would be much less valuable as a service, which could disappear at any point.

lightning_tour.py
# pip install spacy && python -m spacy.en.download
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en')
# Process a document, of any size
text = open('war_and_peace.txt').read()
doc = nlp(text)

from spacy.attrs import *
# All strings mapped to integers, for easy export to numpy
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])

from reddit_corpus import RedditComments
reddit = RedditComments('/path/to/reddit/corpus')
# Parse a stream of documents, with multi-threading (no GIL!)
# Processes over 100,000 tokens per second.
for doc in nlp.pipe(reddit.texts, batch_size=10000, n_threads=4):
    # Multi-word expressions, such as names, dates etc
    # can be merged into single tokens
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, ent.ent_type_)
    # Efficient, lossless serialization --- all annotations
    # saved, same size as uncompressed text
    byte_string = doc.to_bytes()

spaCy is trusted by

About spaCy
What we do

spaCy helps you write programs that do clever things with text. You give it a string of characters, it gives you an object that provides multiple useful views of its meaning and linguistic structure. Specifically, spaCy features a high performance tokenizer, part-of-speech tagger, named entity recognizer and syntactic dependency parser, with built-in support for word vectors. All of the functionality is united behind a clean high-level Python API, that makes it easy to use the different annotations together.

To make spaCy as fast and easy to install as we could, we built it from the ground up from custom components, with custom implementations, and sometimes custom algorithms. It's written in clean but efficient Cython code, which allows us to manage both low level details and the high-level Python API in a single codebase.

What our users say...

Benchmarks
State-of-the-art speed and accuracy

spaCy is committed to rigorous evaluation under standard methodology. Two peer-reviewed papers in 2015 confirm that it offers the fastest syntactic parser in the world and that its accuracy is within 1% of the best available. The few systems that are more accurate are 20× slower or more.

The first of the evaluations was published by Yahoo! Labs and Emory University, as part of a survey of current parsing technologies (Choi et al., 2015). Their results and subsequent discussions helped us develop a novel psychologically-motivated technique to improve spaCy's accuracy, which we published in joint work with Macquarie University (Honnibal and Johnson, 2015).

SystemLanguageAccuracySpeed (WPS)
Cython91.813,963
ClearNLPJava91.710,271
CoreNLPJava89.68,602
MATEJava92.5550
TurboC++92.4349

Latest Blog Posts
Read more about NLP

Statistical NLP in the Ten Hundred Most Common English Words

When I was little, my favorite TV shows all had talking computers. Now I’m big and there are still no talking computers, so I’m trying to make some myself. Well, we can make computers say things. But when we say things back, they don’t really understand. Why not?

Rebuilding a Website with Modular Markup Components

In a small team, everyone should be able to contribute content to the website and make use of the full set of visual components, without having to worry about design or write complex HTML. To help us write docs, tutorials and blog posts about spaCy, we've developed a powerful set of modularized markup components, implemented using Jade.

Sense2vec with spaCy and Gensim

If you were doing text analytics in 2015, you were probably using word2vec. Sense2vec (Trask et. al, 2015) is a new twist on word2vec that lets you learn more interesting, detailed and context-sensitive word vectors. This post motivates the idea, explains our implementation, and comes with an interactive demo that we've found surprisingly addictive.

Sign up for the spaCy newsletter

Stay in the loop!

Receive updates about new releases, tutorials and more.