What is CheXpert?

CheXpert is a large dataset of chest X-rays and competition for automated chest x-ray interpretation, which features uncertainty labels and radiologist-labeled reference standard evaluation sets.

Read the Paper (Irvin & Rajpurkar et al.)

Why CheXpert?

Chest radiography is the most common imaging examination globally, critical for screening, diagnosis, and management of many life threatening diseases. Automated chest radiograph interpretation at the level of practicing radiologists could provide substantial benefit in many medical settings, from improved workflow prioritization and clinical decision support to large-scale screening and global population health initiatives. For progress in both development and validation of automated algorithms, we realized there was a need for a labeled dataset that (1) was large, (2) had strong reference standards, and (3) provided expert human performance metrics for comparison.

Leaderboard

Will your model perform as well as radiologists in detecting different pathologies in chest X-rays?

RankDateModelAUCNum Rads Below Curve
1
Sep 01, 2019Hierarchical-Learning-V1 (ensemble) Vingroup Big Data Institute https://arxiv.org/abs/1911.064750.9302.6
2
Oct 15, 2019Conditional-Training-LSR ensemble 0.9292.6
3
Oct 10, 2019YWW(ensemble) JF&NNU https://github.com/jfhealthcare/Chexpert0.9292.8
4
Oct 17, 2019Conditional-Training-LSR-V1 ensemble 0.9292.6
5
Nov 17, 2019Hierarchical-Learning-V0 (ensemble) Vingroup Big Data Institute 0.9292.6
6
Sep 09, 2019Multi-Stage-Learning-CNN-V3 (ensemble) VINBDI Medical Imaging Team 0.9282.6
7
Sep 19, 2019SenseXDR ensemble 0.9272.6
8
Sep 18, 2019ihil (ensemble) UESTC 0.9272.6
9
Jul 31, 2019bedinezehttps://github.com/deadpoppy/CheXpert-Challenge0.9263.0
10
Sep 01, 2019yww0.9262.6
11
Sep 16, 2019DRNet (ensemble) UESTC and SenseTime 0.9262.6
12
Sep 04, 2019uestc0.9242.6
13
Sep 14, 2019Hierarchical-CNN-Ensemble-V1 (ensemble) Vingroup Big Data Institute 0.9242.4
14
Sep 17, 2019null0.9232.8
15
Aug 22, 2019Multi-Stage-Learning-CNN-V2 (ensemble) VINBDI Medical Imaging Team 0.9232.6
16
Aug 17, 2019yww211https://github.com/inisis/chexpert0.9232.4
17
Nov 21, 2019hoangnguyenkcv170.9212.4
18
Sep 04, 2019null0.9222.2
19
Aug 13, 2019null0.9212.2
20
Aug 15, 2019zjr(ensembel) CSU 0.9212.6
21
Aug 15, 2019hyc ensemble 0.9212.4
22
Nov 22, 2019WA-v1 ensemble 0.9202.6
23
Aug 18, 2019BDNB ensemble 0.9192.6
24
Nov 30, 2019VB-v1 ensemble 0.9192.2
25
Nov 21, 2019hoangnguyenkcv170.9192.4
26
Jul 27, 2019bedinezehttps://github.com/deadpoppy/CheXpert-Challenge0.9192.4
27
Nov 07, 2019brian-baseline-v2 ensemble 0.9192.2
28
Jun 22, 2019Mehdi_You (ensemble) IPM_HPC 0.9182.6
29
Aug 15, 2019Deep-CNNs-V1 ensemble 0.9172.2
30
Nov 22, 2019thangbk(ensemble) SNU 0.9172.0
31
Jul 18, 2019Ensemble_v2 Ian, Wingspan https://github.com/Ien001/CheXpert_challenge_20190.9172.4
32
Nov 25, 2019WA-v2 ensemble 0.9162.4
33
Nov 25, 2019ATT-AW-v1 ensemble 0.9162.4
34
Nov 13, 2019brian-baseline-v3 ensemble 0.9162.2
35
Oct 10, 2019desmond https://github.com/inisis/chexpert https://github.com/inisis/chexpert)0.9162.6
36
Aug 25, 2019Multi-Stage-Learning-CNN-V0 ensemble 0.9162.2
37
Aug 19, 2019TGNB ensemble 0.9152.6
38
Jul 15, 2019Deadpoppy Ensemble ensemble model 0.9152.2
39
Nov 18, 2019hoangnguyenkcv-ensemble28 ensemble 0.9152.2
40
Aug 05, 2019zhangjingyang0.9152.4
41
Aug 16, 2019GRNB ensemble 0.9142.4
42
Jul 31, 2019Deep-CNNs (ensemble) Vingroup Big Data Institute 0.9142.0
43
Dec 02, 2019Sky-Model ensemble 0.9132.2
44
Jul 23, 2019JF Deadpoppy ensemble 0.9132.2
45
Aug 14, 2019ywwhttps://github.com/inisis/chexpert0.9132.0
46
Aug 17, 2019zjy ensemble 0.9122.2
47
Aug 01, 2019songtao0.9112.2
48
Oct 25, 2019bhtrung0.9112.2
48
Oct 27, 2019hoangnguyenkcv0.9112.2
49
Oct 24, 2019brian-baseline ensemble 0.9112.0
50
Jul 31, 2019guran_rib0.9082.0
51
Apr 06, 2019muti_base (ensemble) SCU_MILAB 0.9071.6
52
Jan 23, 2019Stanford Baseline (ensemble) Stanford University https://arxiv.org/abs/1901.070310.9071.8
53
Sep 11, 2019{ForwardModelEnsembleCorrected} (ensemble) Stanford 0.9061.6
54
Jun 14, 2019Multi-CNN ensemble 0.9052.4
55
Jul 25, 2019huangyechong0.9051.8
56
Oct 02, 2019ForwardMECorrectedFull (ensemble) Institution 0.9051.6
57
Jul 21, 2019Multi-CNN ensemble 0.9052.0
58
Jun 22, 2019JustAnotherDensenet single model 0.9041.2
59
Sep 29, 2019Nakajima_ayase0.9011.4
60
Jul 04, 2019Ensemble_v1 Ian, Wingspan https://github.com/Ien001/CheXpert_challenge_20190.9011.6
61
May 24, 2019MLC11 NotDense (single-model) Leibniz University Hannover 0.9001.6
62
Jul 22, 2019adoudou0.8991.6
63
Jul 22, 2019{AVG_MAX}(ensemble) NNU 0.8992.0
64
Aug 01, 2019null0.8991.6
65
Jul 24, 2019llllldz single model 0.8991.6
66
Nov 09, 2019taolei0.8981.8
67
Jun 03, 2019ForwardModelEnsemble (ensemble) Stanford 0.8971.6
68
Jun 25, 2019ldz single model 0.8961.4
69
Jul 13, 2019Deadpoppy Single single model 0.8951.8
70
Aug 02, 2019adoudou0.8951.6
71
Jul 21, 2019hust(single model) HUST 0.8951.0
72
Jul 28, 2019huangyechong0.8941.6
72
Jul 29, 2019zhujieru0.8941.6
73
Jul 04, 2019U-Random-Ind (single) BHSB 0.8941.0
74
Jun 03, 2019HybridModelEnsemble (ensemble) Stanford 0.8921.6
75
Jul 31, 2019pause single model 0.8901.0
76
Jul 22, 2019Haruka_Hamasaki0.8900.80
76
Aug 27, 2019Haruka_Hamasaki0.8900.80
77
Aug 05, 2019DenseNet169 at 320x320 (single model) Lafayette 0.8891.4
78
Jul 05, 2019DataAugFTW (single model) University Hannover 0.8881.0
79
Oct 22, 2019baseline3 single model 0.8861.2
80
Jun 04, 2019Coolver XH single model 0.8840.80
81
Mar 24, 2019Naive Densenet single model https://github.com/simongrest/chexpert-entries0.8831.2
82
May 04, 2019{densenet} (single model) Microsoft 0.8801.2
83
May 14, 2019MLC11 Baseline (single-model) Leibniz University Hannover 0.8780.60
84
May 27, 2019null0.8780.80
85
Jul 07, 2019DenseNet single 0.8761.2
86
Nov 22, 2019HCL1 (single model) LTTS 0.8761.0
87
Oct 28, 2019hoangnguyenkcv0.8751.0
88
Jul 11, 2019MLGCN (single model) sensetime 0.8751.2
89
Jun 03, 2019Multi-CNN (ensemble) VinGroup Big Data Institute 0.8730.40
90
Oct 14, 2019baseline1 (single model) Endimension 0.8680.80
91
Apr 26, 2019aliraza19440.8630.80
92
May 27, 2019null0.8620.40
93
Apr 25, 2019{GoDense} (single model) UPenn 0.8611.0
94
May 16, 2019yww0.8610.40
95
Oct 22, 2019baseline2 single model 0.8581.0
96
Apr 28, 2019rayOfLightSingle (Single Model) GeorgiaTech CSE6250 Team58 0.8510.40
97
Apr 24, 2019Model_Team_34 (single model) Gatech 0.8500.60
98
Apr 27, 2019habbes0.8500.40
99
Apr 27, 2019Baseline Ensemble 0.8480.20
100
Nov 01, 2019HinaNet (single model) VietAI http://vietai.org0.8440.40
101
Apr 24, 2019singlehead_models (single model combined) Gatech CSE6250 Team30 0.8420.20
102
Apr 24, 2019multihead_model (one model for all pathologies) Gatech CSE6250 Team30 0.8380.40
103
Jun 13, 2019MLC9_Densenet (single model) Leibniz University Hannover 0.8340.40
104
May 16, 2019DNET121-single Ian,Wingspan http://www.wingspan.cn/0.8220.0
105
May 11, 2019DenseNet121 (single model) hemil10 0.7600.0
106
Apr 26, 2019rayOfLight (ensemble) GeorgiaTech CSE6250 Team58 0.7270.0
107
Nov 25, 2019BASELINE DenseNet121 single model 0.7240.0

How can I participate?

CheXpert uses a hidden test set for official evaluation of models. Teams submit their executable code on Codalab, which is then run on a test set that is not publicly readable. Such a setup preserves the integrity of the test results.

Here's a tutorial walking you through official evaluation of your model. Once your model has been evaluated officially, your scores will be added to the leaderboard.

How did we collect and label CheXpert?

CheXpert is a large public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients. We retrospectively collected the chest radiographic examinations from Stanford Hospital, performed between October 2002 and July 2017 in both inpatient and outpatient centers, along with their associated radiology reports.

Label Extraction from Radiology Reports

Each report was labeled for the presence of 14 observations as positive, negative, or uncertain. We decided on the 14 observations based on the prevalence in the reports and clinical relevance, conforming to the Fleischner Society’s recommended glossary whenever applicable. We then developed an automated rule-based labeler to extract observations from the free text radiology reports to be used as structured labels for the images.

Our labeler is set up in three distinct stages: mention extraction, mention classification, and mention aggregation. In the mention extraction stage, the labeler extracts mentions from a list of observations from the Impression section of radiology reports, which summarizes the key findings in the radiographic study. In the mention classification stage, mentions of observations are classified as negative, uncertain, or positive. In the mention aggregation stage, we use the classification for each mention of observations to arrive at a final label for the 14 observations (blank for unmentioned, 0 for negative, -1 for uncertain, and 1 for positive).

Use the labeling tool

What is our baseline model?

We train models that take as input a single-view chest radiograph and output the probability of each of the 14 observations. When more than one view is available, the models output the maximum probability of the observations across the views.

Leveraging Uncertainty Labels

The training labels in the dataset for each observation are either 0 (negative), 1 (positive), or u (uncertain). We explore different approaches to using the uncertainty labels during the model training.

  • U-Ignore: We ignore the uncertain labels during training.
  • U-Zeroes: We map all instances of the uncertain label to 0.
  • U-Ones: We map all instances of the uncertain label to 1.
  • U-SelfTrained: We first train a model using the U-Ignore approach to convergence, and then use the model to make predictions that re-label each of the uncertainty labels with the probability prediction outputted by the model.
  • U-MultiClass: We treat the uncertainty label as its own class.

We focus on the evaluation of 5 observations which we call the competition tasks, selected based of clinical importance and prevalence: (a) Atelectasis, (b) Cardiomegaly, (c) Consolidation, (d) Edema, and (e) Pleural Effusion. We compare the performance of the different uncertainty approaches on a validation set of 200 studies on which the consensus of three radiologist annotations serves as ground truth. Our baseline model is selected based on the best performing approach on each competition tasks on the validation set: U-Ones for Atelectasis and Edema, U-MultiClass for Cardiomegaly and Pleural Effusion, and U-SelfTrained for Consolidation.

How is the test designed?

The test set consists of 500 studies from 500 unseen patients. Eight board-certified radiologists individually annotated each of the studies in the test set, classifying each observation into one of present, uncertain likely, uncertain unlikely, and absent. Their annotations were binarized such that all present and uncertain likely cases are treated as positive and all absent and uncertain unlikely cases are treated as negative. The majority vote of 5 radiologist annotations serves as a strong ground truth; the remaining 3 radiologist annotations were used to benchmark radiologist performance.

For each of the 3 individual radiologists and for their majority vote, we compute sensitivity (recall), specificity, and precision against the test set ground truth. To compare the model to radiologists, we plot the radiologist operating points with the model on both the ROC and Precision-Recall (PR) space. We examine whether the radiologist operating points lie below the curves to determine if the model is superior to the radiologists.

How well does the baseline model do on the test set?

The model achieves the best AUC on Pleural Effusion (0.97), and the worst on Atelectasis (0.85). The AUC of all other observations are at least 0.9. On Cardiomegaly, Edema, and Pleural Effusion, the model achieves higher performance than all 3 radiologists but not their majority vote. On Consolidation, model performance exceeds 2 of the 3 radiologists, and on Atelectasis, all 3 radiologists perform better than the model.

Bonus: Extra Dataset From MIT

We're co-releasing our dataset with MIMIC-CXR, a large dataset of 371,920 chest x-rays associated with 227,943 imaging studies sourced from the Beth Israel Deaconess Medical Center between 2011 - 2016. Each imaging study can pertain to one or more images, but most often are associated with two images: a frontal view and a lateral view. Images are provided with 14 labels derived from a natural language processing tool applied to the corresponding free-text radiology reports.

Both our dataset and MIMIC-CXR share a common labeler, the CheXpert labeler, for deriving the same set of labels from free-text radiology reports.

Read MIMIC-CXR paper by Alistair E. W. Johnson, Tom J. Pollard, Seth Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, Roger G. Mark, Steven Horng

Author Notes

One of the main obstacles in the development of chest radiograph interpretation models has been the lack of datasets with strong radiologist-annotated groundtruth and expert scores against which researchers can compare their models. We hope that CheXpert will address that gap, making it easy to track the progress of models over time on a clinically important task.

Furthermore, we have developed and open-sourced the CheXpert labeler, an automated rule-based labeler to extract observations from the free text radiology reports to be used as structured labels for the images. We hope that this makes it easy to help other institutions extract structured labels from their reports and release other large repositories of data that will allow for cross-institutional testing of medical imaging models.

Finally, we hope that the dataset will help development and validation of chest radiograph interpretation models towards improving healthcare access and delivery worldwide.

In the U.S., about half of all radiology studies are x-rays, mostly of the chest. Chest x-ray studies are even more common around the world. Chest x-ray interpretation is a “bread and butter” problem for radiologists with vital public health implications. Chest x-rays can stop the spread of tuberculosis, detect lung cancer early, and support the responsible use of antibiotics.

Ground truth is critical in evaluating deep learning models in medical imaging and provide the foundation for clinical relevance when interpreting results in this field - this is why we focus a lot of our effort on considering the best available ground truth via a panel of medical sub specialist experts to best understand the clinical implication of our model results.

Downloading the Dataset (v1.0)

Please read the Stanford University School of Medicine CheXpert Dataset Research Use Agreement. Once you register to download the CheXpert dataset, you will receive a link to the download over email. Note that you may not share the link to download the dataset with others.

Stanford University School of Medicine CheXpert Dataset Research Use Agreement

By registering for downloads from the CheXpert Dataset, you are agreeing to this Research Use Agreement, as well as to the Terms of Use of the Stanford University School of Medicine website as posted and updated periodically at http://www.stanford.edu/site/terms/.

1. Permission is granted to view and use the CheXpert Dataset without charge for personal, non-commercial research purposes only. Any commercial use, sale, or other monetization is prohibited.

2. Other than the rights granted herein, the Stanford University School of Medicine (“School of Medicine”) retains all rights, title, and interest in the CheXpert Dataset.

3. You may make a verbatim copy of the CheXpert Dataset for personal, non-commercial research use as permitted in this Research Use Agreement. If another user within your organization wishes to use the CheXpert Dataset, they must register as an individual user and comply with all the terms of this Research Use Agreement.

4. YOU MAY NOT DISTRIBUTE, PUBLISH, OR REPRODUCE A COPY of any portion or all of the CheXpert Dataset to others without specific prior written permission from the School of Medicine.

5. YOU MAY NOT SHARE THE DOWNLOAD LINK to the CheXpert dataset to others. If another user within your organization wishes to use the CheXpert Dataset, they must register as an individual user and comply with all the terms of this Research Use Agreement.

6. You must not modify, reverse engineer, decompile, or create derivative works from the CheXpert Dataset. You must not remove or alter any copyright or other proprietary notices in the CheXpert Dataset.

7. The CheXpert Dataset has not been reviewed or approved by the Food and Drug Administration, and is for non-clinical, Research Use Only. In no event shall data or images generated through the use of the CheXpert Dataset be used or relied upon in the diagnosis or provision of patient care.

8. THE CheXpert DATASET IS PROVIDED "AS IS," AND STANFORD UNIVERSITY AND ITS COLLABORATORS DO NOT MAKE ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, NOR DO THEY ASSUME ANY LIABILITY OR RESPONSIBILITY FOR THE USE OF THIS CheXpert DATASET.

9. You will not make any attempt to re-identify any of the individual data subjects. Re-identification of individuals is strictly prohibited. Any re-identification of any individual data subject shall be immediately reported to the School of Medicine.

10. Any violation of this Research Use Agreement or other impermissible use shall be grounds for immediate termination of use of this CheXpert Dataset. In the event that the School of Medicine determines that the recipient has violated this Research Use Agreement or other impermissible use has been made, the School of Medicine may direct that the undersigned data recipient immediately return all copies of the CheXpert Dataset and retain no copies thereof even if you did not cause the violation or impermissible use.

In consideration for your agreement to the terms and conditions contained here, Stanford grants you permission to view and use the CheXpert Dataset for personal, non-commercial research. You may not otherwise copy, reproduce, retransmit, distribute, publish, commercially exploit or otherwise transfer any material.

Limitation of Use

You may use CheXpert Dataset for legal purposes only.

You agree to indemnify and hold Stanford harmless from any claims, losses or damages, including legal fees, arising out of or resulting from your use of the CheXpert Dataset or your violation or role in violation of these Terms. You agree to fully cooperate in Stanford’s defense against any such claims. These Terms shall be governed by and interpreted in accordance with the laws of California.

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison.

Jeremy Irvin *, Pranav Rajpurkar *, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, Andrew Y. Ng

If you have questions about our work, contact us at our google group.

Read the Paper