We analysed the top CheXpert model submissions on their ability to generalize to challenges seen in real world deployment.

Scaling automated chest radiography interpretation introduces novel challenges to deep learning models such as out of distribution data, diseases not seen in training, practical deployment strategies that would work in any healthcare system.

Read Paper

CheXpert models generalize to the task of Tuberculosis detection; a task not seen during training.

In global healthcare settings, it is likely for models to see diseases that weren't seen during training time. CheXpert models were evaluated on their generalizability performance on tuberculosis using consolidation labels as a proxy.

The average AUC of the models on two public TB test datasets (NIH Shenzhen and Montgomery were 0.815 and 0.893, competitive with results in literature when models are trained directly on those datasets.

We also found that the average performance of a model across tasks was a stronger predictor of performance on the tuberculosis dataset as compared to the performance of the model on any of the individual tasks. This suggests that training models to perform well across tasks may allow them to perform better on unseen images than models that optimize for a single task.

CheXpert models maintain high performance on photos of chest radiographs.

Scaled deployment demands a solution that can navigate an endless array of medical imaging and IT infrastructures. Leveraging smartphones as the tool for automated chest radiograph analysis allows for maximal accessibility. While prior literature indicated poor generalizability of deep learning models on photos of images, CheXpert trained models achieved a mean AUC of 0.916 on photos of the CheXpert test set, compared with an AUC of 0.924 on the original CheXpert test set. All of the models had mean AUCs higher than 0.9, and were within 0.01 AUC of their performance on the original images.

Using photos of chest x-rays as input into chest-xray algorithms could enable any physician with a smartphone to get instant AI algorithm assistance.

CheXpert models generalize to data from an external institution

We evaluated the performance of the top CheXpert models on a dataset from an external institution (NIH). Chest x-ray algorithms developed from data from one institution have not shown sustained performance when externally validated on data from a different unrelated institution. This is critical for safe deployment of these algorithms across healthcare systems.

The models achieved an average performance of 0.897 AUC across the 5 CheXpert competition tasks on the test set from the external institution. On Atelectasis, Cardiomegaly, Edema, and Pleural Effusion, the mean sensitivities of the models of 0.750, 0.617, 0.712, and 0.806 respectively, are higher than the mean radiologist sensitivities of 0.646, 0.485, 0.710, and 0.761 (at the mean radiologist specificities of 0.806, 0.924, 0.925, and 0.883 respectively). On Consolidation, the mean sensitivity of the models of 0.443 is lower than the mean radiologist sensitivity of 0.456 (at the mean radiologist specificity of 0.935).

Partner with us