What is CheXpert?

CheXpert is a large dataset of chest X-rays and competition for automated chest x-ray interpretation, which features uncertainty labels and radiologist-labeled reference standard evaluation sets.

Read the Paper (Irvin & Rajpurkar et al.)

Why CheXpert?

Chest radiography is the most common imaging examination globally, critical for screening, diagnosis, and management of many life threatening diseases. Automated chest radiograph interpretation at the level of practicing radiologists could provide substantial benefit in many medical settings, from improved workflow prioritization and clinical decision support to large-scale screening and global population health initiatives. For progress in both development and validation of automated algorithms, we realized there was a need for a labeled dataset that (1) was large, (2) had strong reference standards, and (3) provided expert human performance metrics for comparison.

Leaderboard

Will your model perform as well as radiologists in detecting different pathologies in chest X-rays?

RankDateModelAUCNum Rads Below Curve
1
Aug 31, 2020SuperCNN ensemble 0.9302.8
2
Sep 01, 2019Hierarchical-Learning-V1 (ensemble) Vingroup Big Data Institute https://arxiv.org/abs/1911.064750.9302.6
3
Oct 15, 2019Conditional-Training-LSR ensemble 0.9292.6
4
Dec 04, 2019Hierarchical-Learning-V4 (ensemble) Vingroup Big Data Institute https://arxiv.org/abs/1911.064750.9292.6
5
Oct 10, 2019YWW(ensemble) JF&NNU https://github.com/jfhealthcare/Chexpert0.9292.8
6
Oct 17, 2019Conditional-Training-LSR-V1 ensemble 0.9292.6
7
Nov 17, 2019Hierarchical-Learning-V0 (ensemble) Vingroup Big Data Institute 0.9292.6
8
Sep 09, 2019Multi-Stage-Learning-CNN-V3 (ensemble) VINBDI Medical Imaging Team 0.9282.6
9
Dec 30, 2019DeepCNNsGM(ensemble) HUST 0.9282.6
10
Dec 30, 2019DeepCNNs(ensemble) HUST 0.9272.6
11
Dec 23, 2019inisis https://github.com/inisis/chexpert https://github.com/inisis/chexpert)0.9273.0
11
Dec 16, 2019desmond https://github.com/inisis/chexpert https://github.com/inisis/chexpert)0.9273.0
12
Sep 19, 2019SenseXDR ensemble 0.9272.6
13
Sep 18, 2019ihil (ensemble) UESTC 0.9272.6
14
Jul 31, 2019bedinezehttps://github.com/deadpoppy/CheXpert-Challenge0.9263.0
15
Sep 01, 2019yww0.9262.6
16
Sep 16, 2019DRNet (ensemble) UESTC and SenseTime 0.9262.6
17
Dec 26, 2019Hoang_VB_10.9252.4
17
Feb 11, 2020alimebkovkz0.9252.4
18
Dec 26, 2019hoang_VB_20.9252.4
19
Dec 12, 2019hoangnguyen_VB0.9242.4
20
Dec 17, 2019tdh5121940.9242.4
21
Sep 04, 2019uestc0.9242.6
22
Dec 09, 2019hoanganhbka0.9242.4
23
Dec 04, 2019as-hust-v3 ensemble 0.9242.4
24
Jan 10, 2020hoanganh_VB_VN30.9242.4
25
Sep 14, 2019Hierarchical-CNN-Ensemble-V1 (ensemble) Vingroup Big Data Institute 0.9242.4
26
Apr 25, 2020ltts-mumbai0.9232.6
27
Apr 25, 2020CIO_Mumbai0.9232.6
28
Dec 10, 2019hoanganhcnu10.9232.4
29
Aug 22, 2019Multi-Stage-Learning-CNN-V2 (ensemble) VINBDI Medical Imaging Team 0.9232.6
30
Dec 16, 2019Weighted-CNN(ensemble) HUST 0.9232.6
31
Dec 10, 2019hoanganhcnu20.9232.4
32
Aug 17, 2019yww211https://github.com/inisis/chexpert0.9232.4
33
Dec 04, 2019as-hust-v1 ensemble 0.9232.4
34
Dec 16, 2019Average-CNN(ensemble) HUST 0.9222.4
35
Dec 04, 2019as-hust-v2 ensemble 0.9222.8
36
Aug 04, 2020MaxAUC ensemble 0.9222.4
37
Nov 21, 2019hoangnguyenkcv170.9212.4
38
Sep 04, 2019null0.9222.2
39
Sep 02, 2020SuperCNNv3 ensemble 0.9212.4
40
Aug 13, 2019null0.9212.2
41
Aug 15, 2019zjr(ensembel) CSU 0.9212.6
42
Aug 15, 2019hyc ensemble 0.9212.4
43
Jan 10, 2020HOANG_VB_VN_2 ensemble 0.9202.4
44
Aug 18, 2019BDNB ensemble 0.9192.6
45
Dec 20, 2019thangsnu0.9192.4
46
Jul 16, 2019JF Coolver ensemble ensemble model 0.9192.6
47
Nov 21, 2019hoangnguyenkcv170.9192.4
48
Jul 27, 2019bedinezehttps://github.com/deadpoppy/CheXpert-Challenge0.9192.4
49
Nov 07, 2019brian-baseline-v2 ensemble 0.9192.2
50
Jun 22, 2020ltts-mumbai0.9182.6
51
Jun 22, 2019Mehdi_You (ensemble) IPM_HPC 0.9182.6
52
Aug 15, 2019Deep-CNNs-V1 ensemble 0.9172.2
53
Nov 22, 2019thangbk(ensemble) SNU 0.9172.0
54
Jul 18, 2019Ensemble_v2 Ian, Wingspan https://github.com/Ien001/CheXpert_challenge_20190.9172.4
55
Jan 21, 2020ltts-mumbai0.9172.2
56
Jun 22, 2020ltts-mumbai0.9162.4
57
Nov 25, 2019ATT-AW-v1 ensemble 0.9162.4
58
Dec 14, 2019desmond https://github.com/inisis/chexpert https://github.com/inisis/chexpert)0.9162.6
58
Jun 22, 2020CIO_Mumbai0.9162.6
59
Oct 10, 2019desmond https://github.com/inisis/chexpert https://github.com/inisis/chexpert)0.9162.6
60
Aug 25, 2019Multi-Stage-Learning-CNN-V0 ensemble 0.9162.2
61
Aug 19, 2019TGNB ensemble 0.9152.6
62
Jul 15, 2019Deadpoppy Ensemble ensemble model 0.9152.2
63
Nov 18, 2019hoangnguyenkcv-ensemble28 ensemble 0.9152.2
64
Aug 05, 2019zhangjingyang0.9152.4
65
Dec 10, 2019thangbk0.9152.4
66
Jun 22, 2020CIO_Mumbai0.9142.6
67
Aug 16, 2019GRNB ensemble 0.9142.4
68
Jul 31, 2019Deep-CNNs (ensemble) Vingroup Big Data Institute 0.9142.0
69
Dec 02, 2019Sky-Model ensemble 0.9132.2
70
Jul 23, 2019JF Deadpoppy ensemble 0.9132.2
71
Aug 14, 2019ywwhttps://github.com/inisis/chexpert0.9132.0
72
Aug 17, 2019zjy ensemble 0.9122.2
73
Aug 01, 2019songtao0.9112.2
74
Oct 25, 2019bhtrung0.9112.2
74
Oct 27, 2019hoangnguyenkcv0.9112.2
75
Apr 25, 2020CIO_Mumbai0.9112.0
76
Apr 25, 2020ltts-mumbai0.9112.0
77
Oct 24, 2019brian-baseline ensemble 0.9112.0
78
Dec 10, 2019thangsnu0.9102.2
79
Dec 09, 2019HinaNetV2 (ensemble) VietAI http://vietai.org0.9092.2
80
Jan 12, 2020vdnnn (ensemble) LTTS 0.9081.8
81
Jul 31, 2019guran_rib0.9082.0
82
Jan 12, 2020vbn (single model) LTTS 0.9071.6
83
Apr 06, 2019muti_base (ensemble) SCU_MILAB 0.9071.6
84
Jan 23, 2019Stanford Baseline (ensemble) Stanford University https://arxiv.org/abs/1901.070310.9071.8
85
Sep 11, 2019{ForwardModelEnsembleCorrected} (ensemble) Stanford 0.9061.6
86
Jun 14, 2019Multi-CNN ensemble 0.9052.4
87
Jul 25, 2019huangyechong0.9051.8
88
Oct 02, 2019ForwardMECorrectedFull (ensemble) Institution 0.9051.6
89
Jul 21, 2019Multi-CNN ensemble 0.9052.0
90
Jun 22, 2019JustAnotherDensenet single model 0.9041.2
91
Mar 03, 2020DeepLungsEnsemble Alimbekov R. & Vassilenko I. 0.9021.8
92
Sep 29, 2019Nakajima_ayase0.9011.4
93
Jul 04, 2019Ensemble_v1 Ian, Wingspan https://github.com/Ien001/CheXpert_challenge_20190.9011.6
94
May 24, 2019MLC11 NotDense (single-model) Leibniz University Hannover 0.9001.6
95
Jan 23, 2020ltts-mumbai0.9001.2
96
Jul 22, 2019adoudou0.8991.6
97
Jul 22, 2019{AVG_MAX}(ensemble) NNU 0.8992.0
98
Aug 01, 2019null0.8991.6
99
Jul 24, 2019llllldz single model 0.8991.6
100
Nov 09, 2019taolei0.8981.8
101
Jun 03, 2019ForwardModelEnsemble (ensemble) Stanford 0.8971.6
102
Feb 05, 2020ylz-v01 single model 0.8961.6
103
Jun 25, 2019ldz single model 0.8961.4
104
Feb 17, 2020Densenet single 0.8961.4
105
Jul 13, 2019Deadpoppy Single single model 0.8951.8
106
Aug 02, 2019adoudou0.8951.6
107
Dec 26, 2019MM1 ensemble 0.8941.6
108
Jul 21, 2019hust(single model) HUST 0.8951.0
109
Jul 29, 2019zhujieru0.8941.6
109
Jul 28, 2019huangyechong0.8941.6
110
Jul 04, 2019U-Random-Ind (single) BHSB 0.8941.0
111
Jun 03, 2019HybridModelEnsemble (ensemble) Stanford 0.8921.6
112
Feb 05, 2020ylz-v02 single model 0.8911.0
113
Jul 31, 2019pause single model 0.8901.0
114
Jul 22, 2019Haruka_Hamasaki0.8900.80
114
Aug 27, 2019Haruka_Hamasaki0.8900.80
115
Aug 05, 2019DenseNet169 at 320x320 (single model) Lafayette 0.8891.4
116
Apr 12, 2020LR-baseline (ensemble) IITB 0.8891.4
117
Jul 05, 2019DataAugFTW (single model) University Hannover 0.8881.0
118
Aug 25, 2020{koala} (single model) SJTU 0.8881.0
119
Oct 22, 2019baseline3 single model 0.8861.2
120
Jun 04, 2019Coolver XH single model 0.8840.80
121
Mar 24, 2019Naive Densenet single model https://github.com/simongrest/chexpert-entries0.8831.2
122
May 04, 2019{densenet} (single model) Microsoft 0.8801.2
123
May 14, 2019MLC11 Baseline (single-model) Leibniz University Hannover 0.8780.60
124
May 27, 2019null0.8780.80
125
Jul 07, 2019DenseNet single 0.8761.2
126
Nov 22, 2019HCL1 (single model) LTTS 0.8761.0
127
Oct 28, 2019hoangnguyenkcv0.8751.0
128
Jul 11, 2019MLGCN (single model) sensetime 0.8751.2
129
Jun 03, 2019Multi-CNN (ensemble) VinGroup Big Data Institute 0.8730.40
130
Oct 14, 2019baseline1 (single model) Endimension 0.8680.80
131
May 03, 2020DSENet single model 0.8650.60
132
Apr 26, 2019aliraza19440.8630.80
133
May 27, 2019null0.8620.40
134
Apr 25, 2019{GoDense} (single model) UPenn 0.8611.0
135
May 16, 2019yww0.8610.40
136
Jun 29, 2020BASELINE Acorn single model 0.8600.60
137
Oct 22, 2019baseline2 single model 0.8581.0
138
Apr 28, 2019rayOfLightSingle (Single Model) GeorgiaTech CSE6250 Team58 0.8510.40
139
Apr 24, 2019Model_Team_34 (single model) Gatech 0.8500.60
140
Apr 27, 2019habbes0.8500.40
141
Apr 27, 2019Baseline Ensemble 0.8480.20
142
Nov 01, 2019HinaNet (single model) VietAI http://vietai.org0.8440.40
143
Apr 24, 2019singlehead_models (single model combined) Gatech CSE6250 Team30 0.8420.20
144
Apr 24, 2019multihead_model (one model for all pathologies) Gatech CSE6250 Team30 0.8380.40
145
Jun 13, 2019MLC9_Densenet (single model) Leibniz University Hannover 0.8340.40
146
Feb 16, 2020Grp12v2USup2OSamp (ensemble) AITD 0.8300.20
147
Feb 15, 2020Grp12v2USup2USamp (ensemble) AITD 0.8290.20
148
May 16, 2019DNET121-single Ian,Wingspan http://www.wingspan.cn/0.8220.0
149
Jan 14, 2020Grp12v1UShuff (ensemble) AITD 0.8210.40
150
Jul 02, 202012ASLv2(single) AITD 0.7690.0
151
May 11, 2019DenseNet121 (single model) hemil10 0.7600.0
152
Jul 02, 202012ASLv1(single) AITD 0.7360.0
153
Jun 23, 2020Grp12v2HDCNN AITD 0.7320.0
154
Apr 26, 2019rayOfLight (ensemble) GeorgiaTech CSE6250 Team58 0.7270.0
155
Nov 25, 2019BASELINE DenseNet121 single model 0.7240.0
156
Jun 23, 2020Grp12v1HDCNN AITD 0.7010.0

How can I participate?

CheXpert uses a hidden test set for official evaluation of models. Teams submit their executable code on Codalab, which is then run on a test set that is not publicly readable. Such a setup preserves the integrity of the test results.

Here's a tutorial walking you through official evaluation of your model. Once your model has been evaluated officially, your scores will be added to the leaderboard.

How did we collect and label CheXpert?

CheXpert is a large public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients. We retrospectively collected the chest radiographic examinations from Stanford Hospital, performed between October 2002 and July 2017 in both inpatient and outpatient centers, along with their associated radiology reports.

Label Extraction from Radiology Reports

Each report was labeled for the presence of 14 observations as positive, negative, or uncertain. We decided on the 14 observations based on the prevalence in the reports and clinical relevance, conforming to the Fleischner Society’s recommended glossary whenever applicable. We then developed an automated rule-based labeler to extract observations from the free text radiology reports to be used as structured labels for the images.

Our labeler is set up in three distinct stages: mention extraction, mention classification, and mention aggregation. In the mention extraction stage, the labeler extracts mentions from a list of observations from the Impression section of radiology reports, which summarizes the key findings in the radiographic study. In the mention classification stage, mentions of observations are classified as negative, uncertain, or positive. In the mention aggregation stage, we use the classification for each mention of observations to arrive at a final label for the 14 observations (blank for unmentioned, 0 for negative, -1 for uncertain, and 1 for positive).

Use the labeling tool

What is our baseline model?

We train models that take as input a single-view chest radiograph and output the probability of each of the 14 observations. When more than one view is available, the models output the maximum probability of the observations across the views.

Leveraging Uncertainty Labels

The training labels in the dataset for each observation are either 0 (negative), 1 (positive), or u (uncertain). We explore different approaches to using the uncertainty labels during the model training.

  • U-Ignore: We ignore the uncertain labels during training.
  • U-Zeroes: We map all instances of the uncertain label to 0.
  • U-Ones: We map all instances of the uncertain label to 1.
  • U-SelfTrained: We first train a model using the U-Ignore approach to convergence, and then use the model to make predictions that re-label each of the uncertainty labels with the probability prediction outputted by the model.
  • U-MultiClass: We treat the uncertainty label as its own class.

We focus on the evaluation of 5 observations which we call the competition tasks, selected based of clinical importance and prevalence: (a) Atelectasis, (b) Cardiomegaly, (c) Consolidation, (d) Edema, and (e) Pleural Effusion. We compare the performance of the different uncertainty approaches on a validation set of 200 studies on which the consensus of three radiologist annotations serves as ground truth. Our baseline model is selected based on the best performing approach on each competition tasks on the validation set: U-Ones for Atelectasis and Edema, U-MultiClass for Cardiomegaly and Pleural Effusion, and U-SelfTrained for Consolidation.

How is the test designed?

The test set consists of 500 studies from 500 unseen patients. Eight board-certified radiologists individually annotated each of the studies in the test set, classifying each observation into one of present, uncertain likely, uncertain unlikely, and absent. Their annotations were binarized such that all present and uncertain likely cases are treated as positive and all absent and uncertain unlikely cases are treated as negative. The majority vote of 5 radiologist annotations serves as a strong ground truth; the remaining 3 radiologist annotations were used to benchmark radiologist performance.

For each of the 3 individual radiologists and for their majority vote, we compute sensitivity (recall), specificity, and precision against the test set ground truth. To compare the model to radiologists, we plot the radiologist operating points with the model on both the ROC and Precision-Recall (PR) space. We examine whether the radiologist operating points lie below the curves to determine if the model is superior to the radiologists.

How well does the baseline model do on the test set?

The model achieves the best AUC on Pleural Effusion (0.97), and the worst on Atelectasis (0.85). The AUC of all other observations are at least 0.9. On Cardiomegaly, Edema, and Pleural Effusion, the model achieves higher performance than all 3 radiologists but not their majority vote. On Consolidation, model performance exceeds 2 of the 3 radiologists, and on Atelectasis, all 3 radiologists perform better than the model.

Bonus: Extra Dataset From MIT

We're co-releasing our dataset with MIMIC-CXR, a large dataset of 371,920 chest x-rays associated with 227,943 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016. Each imaging study can pertain to one or more images, but most often are associated with two images: a frontal view and a lateral view. Images are provided with 14 labels derived from a natural language processing tool applied to the corresponding free-text radiology reports.

Both our dataset and MIMIC-CXR share a common labeler, the CheXpert labeler, for deriving the same set of labels from free-text radiology reports.

Read MIMIC-CXR paper by Alistair E. W. Johnson, Tom J. Pollard, Seth Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, Steven Horng

Author Notes

One of the main obstacles in the development of chest radiograph interpretation models has been the lack of datasets with strong radiologist-annotated groundtruth and expert scores against which researchers can compare their models. We hope that CheXpert will address that gap, making it easy to track the progress of models over time on a clinically important task.

Furthermore, we have developed and open-sourced the CheXpert labeler, an automated rule-based labeler to extract observations from the free text radiology reports to be used as structured labels for the images. We hope that this makes it easy to help other institutions extract structured labels from their reports and release other large repositories of data that will allow for cross-institutional testing of medical imaging models.

Finally, we hope that the dataset will help development and validation of chest radiograph interpretation models towards improving healthcare access and delivery worldwide.

In the U.S., about half of all radiology studies are x-rays, mostly of the chest. Chest x-ray studies are even more common around the world. Chest x-ray interpretation is a “bread and butter” problem for radiologists with vital public health implications. Chest x-rays can stop the spread of tuberculosis, detect lung cancer early, and support the responsible use of antibiotics.

Ground truth is critical in evaluating deep learning models in medical imaging and provide the foundation for clinical relevance when interpreting results in this field - this is why we focus a lot of our effort on considering the best available ground truth via a panel of medical sub specialist experts to best understand the clinical implication of our model results.

Downloading the Dataset (v1.0)

Please read the Stanford University School of Medicine CheXpert Dataset Research Use Agreement. Once you register to download the CheXpert dataset, you will receive a link to the download over email. Note that you may not share the link to download the dataset with others.

Stanford University School of Medicine CheXpert Dataset Research Use Agreement

By registering for downloads from the CheXpert Dataset, you are agreeing to this Research Use Agreement, as well as to the Terms of Use of the Stanford University School of Medicine website as posted and updated periodically at http://www.stanford.edu/site/terms/.

1. Permission is granted to view and use the CheXpert Dataset without charge for personal, non-commercial research purposes only. Any commercial use, sale, or other monetization is prohibited.

2. Other than the rights granted herein, the Stanford University School of Medicine (“School of Medicine”) retains all rights, title, and interest in the CheXpert Dataset.

3. You may make a verbatim copy of the CheXpert Dataset for personal, non-commercial research use as permitted in this Research Use Agreement. If another user within your organization wishes to use the CheXpert Dataset, they must register as an individual user and comply with all the terms of this Research Use Agreement.

4. YOU MAY NOT DISTRIBUTE, PUBLISH, OR REPRODUCE A COPY of any portion or all of the CheXpert Dataset to others without specific prior written permission from the School of Medicine.

5. YOU MAY NOT SHARE THE DOWNLOAD LINK to the CheXpert dataset to others. If another user within your organization wishes to use the CheXpert Dataset, they must register as an individual user and comply with all the terms of this Research Use Agreement.

6. You must not modify, reverse engineer, decompile, or create derivative works from the CheXpert Dataset. You must not remove or alter any copyright or other proprietary notices in the CheXpert Dataset.

7. The CheXpert Dataset has not been reviewed or approved by the Food and Drug Administration, and is for non-clinical, Research Use Only. In no event shall data or images generated through the use of the CheXpert Dataset be used or relied upon in the diagnosis or provision of patient care.

8. THE CheXpert DATASET IS PROVIDED "AS IS," AND STANFORD UNIVERSITY AND ITS COLLABORATORS DO NOT MAKE ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, NOR DO THEY ASSUME ANY LIABILITY OR RESPONSIBILITY FOR THE USE OF THIS CheXpert DATASET.

9. You will not make any attempt to re-identify any of the individual data subjects. Re-identification of individuals is strictly prohibited. Any re-identification of any individual data subject shall be immediately reported to the School of Medicine.

10. Any violation of this Research Use Agreement or other impermissible use shall be grounds for immediate termination of use of this CheXpert Dataset. In the event that the School of Medicine determines that the recipient has violated this Research Use Agreement or other impermissible use has been made, the School of Medicine may direct that the undersigned data recipient immediately return all copies of the CheXpert Dataset and retain no copies thereof even if you did not cause the violation or impermissible use.

In consideration for your agreement to the terms and conditions contained here, Stanford grants you permission to view and use the CheXpert Dataset for personal, non-commercial research. You may not otherwise copy, reproduce, retransmit, distribute, publish, commercially exploit or otherwise transfer any material.

Limitation of Use

You may use CheXpert Dataset for legal purposes only.

You agree to indemnify and hold Stanford harmless from any claims, losses or damages, including legal fees, arising out of or resulting from your use of the CheXpert Dataset or your violation or role in violation of these Terms. You agree to fully cooperate in Stanford’s defense against any such claims. These Terms shall be governed by and interpreted in accordance with the laws of California.

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison.

Jeremy Irvin *, Pranav Rajpurkar *, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, Andrew Y. Ng

If you have questions about our work, contact us at our google group.

Read the Paper