We introduce MURA, a large dataset of musculoskeletal radiographs, where each study is manually labeled by radiologists as either normal or abnormal.

Large, high-quality datasets have played a critical role in driving progress of fields with deep learning methods. To this end, we introduce one of the largest public radiographic image datasets.

The MURA abnormality detection task is a binary classification task, where the input is an upper exremity radiograph study --- with each study containing one or more views (images) --- and the expected output is a binary label indicating whether the study is normal or abnormal respectively.

Instructions to download the data and submit your model on the hidden test set coming soon.

We assemble a dataset of musculoskeletal radiographs consisting of 14,982 studies from 12,251 patients, with a total of 40,895 multi-view radiographic images.

Each belongs to one of seven standard upper extremity radiographic study types: elbow, finger, forearm, hand, humerus, shoulder, and wrist.

Test Set

To evaluate models and get a robust estimate of radiologist performance, we collected additional labels from six board-certified Stanford radiologists on the test set, consisting of 209 musculoskeletal studies. The radiologists individually retrospectively reviewed and labeled each study in the test set as a DICOM file as normal or abnormal in the clinical reading room environment using the PACS system. The radiologists have 8.83 years of experience on average ranging from 2 to 25 years.

On MURA, we develop a model that uses a 169-layer convolutional neural network to detect and localize abnormalities.

The model takes as input one or more views for a study of an upper extremity. On each view, our 169-layer convolutional neural network predicts the probability of abnormality. We compute the overall probability of abnormality for the study by taking the arithmetic mean of the abnormality probabilities output by the network for each image. The model makes the binary prediction of abnormal if the probability of abnormality for the study is greater than 0.5.

The network uses a Dense Convolutional Network architecture, which connects each layer to every other layer in a feed-forward fashion to make the optimization of deep networks tractable. We replace the final fully connected layer with one that has a single output, after which we apply a sigmoid nonlinearity. We use Class Activation Maps to visualize the parts of the radiograph which contribute most to the model's prediction of abnormality.

We find that our model achieves performance comparable to that of radiologists.

Model performance is higher than the best radiologist performance in detecting abnormalities on finger studies and equivalent on wrist studies. However, model performance is lower than best radiologist performance in detecting abnormalities on elbow, forearm, hand, humerus, and shoulder studies, indicating that the task is a good challenge for future research.

Recall that for each study in the test set, we collected additional normal/abnormal labels from 6 board-certified radiologists. We randomly chose 3 of these radiologists to create a gold standard, defined as the majority vote of labels of the radiologists. We used the other 3 radiologists to get estimates of radiologist performance on the task.

We compared radiologists and our model on the Cohen’s kappa statistic, which expresses the agreement of each radiologist / model with the gold standard, defined as the majority vote of a disjoint group of radiologists. Here, we highlight the best (green) and worst (red) performances on each of the study types and in aggregate.

Musculoskeletal conditions affect more than 1.7 billion people worldwide, and are the most common cause of severe, long-term pain and disability, with 30 million emergency department visits annually and increasing. We hope that our dataset can lead to significant advances in medical imaging technologies which can diagnose at the level of experts, towards improving healthcare access in parts of the world where access to skilled radiologists is limited.

Read our paper

Will your model close the gap to radiologist-level performance?

MURA Dataset, Leaderboard and Instructions coming soon!

If you have questions about our work, contact us at:

pranavsr@cs.stanford.edu and jirvin16@cs.stanford.edu

We will be making our code public by February 2018.