SQuAD

The Stanford Question Answering Dataset

What is SQuAD?

Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.

Explore SQuAD and model predictions

Getting Started

We've built a few resources to help you get started with the dataset.

Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

To evaluate your models, we have also made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input. To run the evaluation, use python evaluate-v1.1.py <path_to_dev-v1.1> <path_to_predictions>.

Once you have a built a model that works to your expectations on the dev set, you submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we require you to submit your model so that we can run it on the test set for you. Here's a tutorial walking you through official evaluation of your model:

Submission Tutorial

Because SQuAD is an ongoing effort, we expect the dataset to evolve.

To keep up to date with major changes to the dataset, please subscribe:

Have Questions?

Ask us questions at our google group or at pranavsr@stanford.edu.

Star

Test Set Leaderboard

Since the release of our dataset (and paper), the community has made rapid progress! Here are the ExactMatch (EM) and F1 scores of the best models evaluated on the test and development sets of v1.1.

RankModelTest EMTest F1
1r-net (ensemble)
(Microsoft Research Asia)
74.582.0
2BiDAF (ensemble)
Allen Institute for AI & University of Washington
(Seo et al. '16)
73.381.1
3Dynamic Coattention Networks (ensemble)
Salesforce Research
(Xiong & Zhong et al. '16)
71.680.4
4r-net (single model)
Microsoft Research Asia
69.577.9
5BiDAF (single model)
Allen Institute for AI & University of Washington
(Seo et al. '16)
68.077.3
5Multi-Perspective Matching (ensemble)
IBM Research
68.277.2
7Match-LSTM with Ans-Ptr Boundary (ensemble)
Singapore Management University
67.977.0
8Dynamic Coattention Networks (single model)
Salesforce Research
(Xiong & Zhong et al. '16)
66.275.9
9Multi-Perspective Matching (single model)
IBM Research
65.575.1
10Match-LSTM with Bi-Ans-Ptr Boundary (single model)
Singapore Management University
64.773.7
11 Fine-Grained Gating
Carnegie Mellon University
(Yang et al. '16)
62.573.3
12Dynamic Chunk Reader
IBM
(Yu & Zhang et al. '16)
62.571.0
13Match-LSTM with Ans-Ptr (Boundary)
Singapore Management University
(Wang & Jiang '16)
60.570.7
14Match-LSTM with Ans-Ptr (Sequence)
Singapore Management University
(Wang & Jiang '16)
54.567.7
15 Logistic Regression Baseline
Stanford University
(Rajpurkar et al. '16)
40.451.0

Will your model outperform humans on the QA task?

Human Performance
Stanford University
(Rajpurkar et al. '16)
82.3 91.2

Development Set Leaderboard

While you are iterating on models, use the development set to get an indicator of your model's performance.

ModelDev EMDev F1
RaSoR (ensemble) (Google NY)
(Lee et al. '16)
68.276.7
RaSoR (Google NY)
(Lee et al. '16)
66.474.9
Dynamic Chunk Ranker with Convolution layer (IBM)66.374.7
Attentive Chunker (IBM)48.064.5