Industrial-strength
Natural Language
Processing

Thousands of researchers are trying to make
computers understand text. They're succeeding.
We help you get their work out of papers and
into production.

Install spaCy
Latest Release: v0.101.0

Built for Production

Most AI software is built for research. Over the last ten years, we've used a lot of that software, and built some of it ourselves, especially for natural language processing (NLP). But the faster the research has moved, the more impatient we've become. We want to see advanced NLP technologies get out into great products, as the basis of great businesses. We built spaCy to make that happen.

Easy and Powerful

For any NLP task, there are always lots of competing algorithms. We don't believe in implementing them all and letting you choose. Instead, we just implement one – the best one. When better algorithms are developed, we can update the library without breaking your code or bloating the API. This approach makes spaCy both easier and more powerful than a pluggable architecture. spaCy also features a unique whole-document design. Where other NLP libraries rely on sentence detection as a pre-process, spaCy reads the whole document at once, making it much more robust to informal and poorly formatted text.

Permissive open-source license (MIT)

We think spaCy is valuable software, so we made it free, to raise its value even higher. Making spaCy open-source puts us on the same side – we can tell you everything about how it works, and let you run it however you like. We think the software would be much less valuable as a service, which could disappear at any point.

lightning_tour.py
# pip install spacy && python -m spacy.en.download
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en')
# Process a document, of any size
text = open('war_and_peace.txt').read()
doc = nlp(text)

from spacy.attrs import *
# All strings mapped to integers, for easy export to numpy
np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])

from reddit_corpus import RedditComments
reddit = RedditComments('/path/to/reddit/corpus')
# Parse a stream of documents, with multi-threading (no GIL!)
# Processes over 100,000 tokens per second.
for doc in nlp.pipe(reddit.texts, batch_size=10000, n_threads=4):
    # Multi-word expressions, such as names, dates etc
    # can be merged into single tokens
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, ent.ent_type_)
    # Efficient, lossless serialization --- all annotations
    # saved, same size as uncompressed text
    byte_string = doc.to_bytes()

spaCy is trusted by

About spaCy
What we do

spaCy helps you write programs that do clever things with text. You give it a string of characters, it gives you an object that provides multiple useful views of its meaning and linguistic structure. Specifically, spaCy features a high performance tokenizer, part-of-speech tagger, named entity recognizer and syntactic dependency parser, with built-in support for word vectors. All of the functionality is united behind a clean high-level Python API, that makes it easy to use the different annotations together.

To make spaCy as fast and easy to install as we could, we built it from the ground up from custom components, with custom implementations, and sometimes custom algorithms. It's written in clean but efficient Cython code, which allows us to manage both low level details and the high-level Python API in a single codebase.

What our users say...

Benchmarks
State-of-the-art speed and accuracy

spaCy is committed to rigorous evaluation under standard methodology. Two peer-reviewed papers in 2015 confirm that it offers the fastest syntactic parser in the world and that its accuracy is within 1% of the best available. The few systems that are more accurate are 20× slower or more.

The first of the evaluations was published by Yahoo! Labs and Emory University, as part of a survey of current parsing technologies (Choi et al., 2015). Their results and subsequent discussions helped us develop a novel psychologically-motivated technique to improve spaCy's accuracy, which we published in joint work with Macquarie University (Honnibal and Johnson, 2015).

SystemLanguageAccuracySpeed (WPS)
Cython91.813,963
ClearNLPJava91.710,271
CoreNLPJava89.68,602
MATEJava92.5550
TurboC++92.4349

Latest Blog Posts
Read more about NLP

SyntaxNet in context: Understanding Google's new TensorFlow NLP model

Yesterday, Google open sourced their Tensorflow-based dependency parsing library, SyntaxNet. The library gives access to a line of neural network parsing models published by Google researchers over the last two years. I've been following this work closely, and have been looking forward to the software being released. This post tries to provide some context around the release — what's new here, and how important is it?

spaCy now speaks German

Many people have asked us to make spaCy available for their language. Being based in Berlin, German was an obvious choice for our first second language. Now spaCy can do all the cool things you use for processing English on German text too. But more importantly, teaching spaCy to speak German required us to drop some comfortable but English-specific assumptions about how language works and made spaCy fit to learn more languages in the future.

Multi-threading spaCy's parser and named entity recogniser

In v0.100.3, we quietly rolled out support for GIL-free multi-threading for spaCy's syntactic dependency parsing and named entity recognition models. Because these models take up a lot of memory, we've wanted to release the global interpretter lock (GIL) around them for a long time. When we finally did, it seemed a little too good to be true, so we delayed celebration — and then quickly moved on to other things. It's now past time for a write-up.

Sign up for the spaCy newsletter

Stay in the loop!

Receive updates about new releases, tutorials and more.