Industrial-Strength
Natural Language
Processing

in Python

Get things done

spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It's easy to install, and its API is simple and productive. We like to think of spaCy as the Ruby on Rails of Natural Language Processing.

Blazing fast

spaCy excels at large-scale information extraction tasks. It's written from the ground up in carefully memory-managed Cython. Independent research in 2015 found spaCy to be the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using.

Deep learning

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, scikit-learn, Gensim and the rest of Python's awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.

Edit the code & try spaCy
# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Features

Non-destructive tokenization
Named entity recognition
Support for 63+ languages
46 statistical models for 16 languages
Pretrained word vectors
State-of-the-art speed
Easy deep learning integration
Part-of-speech tagging
Labelled dependency parsing
Syntax-driven sentence segmentation
Built in visualizers for syntax and NER
Convenient string-to-hash mapping
Export to numpy data arrays
Efficient binary serialization
Easy model packaging and deployment
Robust, rigorously evaluated accuracy

Out now
spaCy v3.0: Transformer-based pipelines, new training system, project templates & more

spaCy v3.0 features all new transformer-based pipelines that bring spaCy's accuracy right up to the current state-of-the-art. You can use any pretrained transformer to train your own pipelines, and even share one transformer between multiple components with multi-task learning. Training is now fully configurable and extensible, and you can define your own custom models using PyTorch, TensorFlow and other frameworks. The new spaCy projects system lets you describe whole end-to-end workflows in a single file, giving you an easy path from prototype to production, and making it easy to clone and adapt best-practice projects for your own use cases.

See what's new

From the makers of spaCy
Prodigy: Radically efficient machine teaching

Prodigy is an annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration. Whether you're working on entity recognition, intent detection or image classification, Prodigy can help you train and evaluate your models faster.

Try it out

spaCy is trusted by

and many more

Featured on

In this free and interactive online course you’ll learn how to use spaCy to build advanced natural language understanding systems, using both rule-based and machine learning approaches. It includes 55 exercises featuring videos, slide decks, multiple-choice questions and interactive coding practice in the browser.

Start the course

New in v2.1
BERT-style language model pretraining

Learn more from small training corpora by initializing your models with knowledge from raw text. The new pretrain command teaches spaCy's CNN model to predict words based on their context, producing representations of words in contexts. If you've seen Google's BERT system or fast.ai's ULMFiT, spaCy's pretraining is similar – but much more efficient. It's still experimental, but users are already reporting good results, so give it a try!

Benchmarks

In 2015, independent researchers from Emory University and Yahoo! Labs showed that spaCy offered the fastest syntactic parser in the world and that its accuracy was within 1% of the best available (Choi et al., 2015). spaCy v2.0, released in 2017, is more accurate than any of the systems Choi et al. evaluated.

See details

System	Year	Language	Accuracy	Speed (wps)
spaCy v2.x	2017	Python / Cython	92.6	n/a
spaCy v1.x	2015	Python / Cython	91.8	13,963
ClearNLP	2015	Java	91.7	10,271
CoreNLP	2015	Java	89.6	8,602
MATE	2015	Java	92.5	550
Turbo	2015	C++	92.4	349

Get things done

Blazing fast

Deep learning

Edit the code & try spaCy

Features

Out nowspaCy v3.0: Transformer-based pipelines, new training system, project templates & more

From the makers of spaCyProdigy: Radically efficient machine teaching

New in v2.1BERT-style language model pretraining

Benchmarks

Out now
spaCy v3.0: Transformer-based pipelines, new training system, project templates & more

From the makers of spaCy
Prodigy: Radically efficient machine teaching

New in v2.1
BERT-style language model pretraining