We developed CheXNeXt, a deep learning algorithm to concurrently detect 14 clinically important diseases in chest radiographs.

Chest radiograph interpretation is critical for the detection of acute thoracic diseases, including tuberculosis and lung cancer, which affect millions of people worldwide each year. This time-consuming task typically requires expert radiologists to read the images, leading to fatigue-based diagnostic error and lack of diagnostic expertise in areas of the world where radiologists are not available.

CheXNeXt is trained to predict diseases on x-ray images and highlight parts of an image most indicative of each predicted disease.

CheXNeXt is trained on the ChestX-ray14 dataset, one of the largest public repository of radiographs, containing 112,120 frontal-view chest radiographs of 30,805 unique patients. Each image in ChestX-ray14 was labeled using an automatic extraction method on radiology reports.

CheXNeXt's training process consists of 2 consecutive stages to account for the partially incorrect labels in the ChestX-ray14 dataset. First, an ensemble of networks is trained on the training set to predict the probability that each of the 14 pathologies is present in the image. The predictions of this ensemble are used to relabel the training and tuning sets. A new ensemble of networks are finally trained on this relabeled training set.

Without any additional supervision, CheXNeXt produces heat maps that identify locations in the chest radiograph that contribute most to the network’s classification using class activation mappings (CAMs).

We compared CheXNeXt’s performance to the performance of 9 radiologists on a validation set.

A validation set of 420 frontal-view chest radiographs was selected from ChestX-ray14 for radiologist annotation. The set was curated to contain at least 50 cases of each pathology according to the original labels provided in the dataset.

The majority vote of the annotations of 3 cardiothoracic specialist radiologists serves as the consensus reference standard for each image. To compare to the algorithm, 6 board-certified radiologists from 3 academic institutions (average experience 12 years) and 3 senior radiology residents also annotated the validation set of 420 radiographs for all 14 labels. The ROC curve of the algorithm is generated by varying the discrimination threshold (used to convert the output probabilities to binary predictions). The radiologist ROC curve is estimated by fitting an increasing concave curve to the radiologist operating points.

The algorithm achieved performance equivalent to the practicing radiologists on 10 pathologies, better on 1 pathology, and worse on 3 pathologies, on the AUC metric.

We found that CheXNeXt can diagnose certain pathologies in chest radiographs at a level comparable to practicing radiologists.

CheXNeXt detected nodules with a sensitivity of 0.690 (95% CI 0.581–0.797) which was higher than the micro-average sensitivity of board-certified radiologists at 0.573 (95% CI 0.525–0.619). CheXNeXt maintained a high specificity of 0.900 (95% CI 0.867–0.931) in nodule detection compared with the radiologist score of 0.937 (95% CI 0.927–0.947). The algorithm detected effusion with a specificity of 0.921 (95% CI 0.889–0.951), higher than micro-average board-certified radiologist specificity of 0.883 (95% CI 0.868–0.898) while achieving a sensitivity of 0.674 (95% CI 0.592–0.754), comparable to micro-average board-certified radiologist sensitivity of 0.761 (95% CI 0.731–0.790).

The average time for radiologists to complete labeling of 420 chest radiographs was 240 minutes (range 180–300 minutes). The deep learning algorithm labeled the same 420 chest radiographs in 1.5 minutes.

We are currently looking for research partnerships with healthcare providers that are interested in working with us to validate the technology.

We hope that this technology may have the potential to improve healthcare delivery and increase access to chest radiograph expertise globally. Towards this goal, the future of this research will depend on obtaining access to more sources of data for training and improving the model, as well as testing it on new populations and diagnoses. This additional data will help improve the accuracy and robustness of the model, making it more safe and effective.

Partner with us

To learn more, read our publication in PLOS Medicine.

Read Paper

If you have questions about our work, contact us at:

pranavsr@cs.stanford.edu and jirvin16@cs.stanford.edu