Large, high-quality datasets have played a critical role in driving progress of fields with deep learning methods. To this end, we introduce one of the largest public radiographic image datasets.
The MURA abnormality detection task is a binary classification task, where the input is an upper exremity radiograph study --- with each study containing one or more views (images) --- and the expected output is a binary label indicating whether the study is normal or abnormal respectively.
Instructions to download the data and submit your model on the hidden test set coming soon.
Each belongs to one of seven standard upper extremity radiographic study types: elbow, finger, forearm, hand, humerus, shoulder, and wrist.
To evaluate models and get a robust estimate of radiologist performance, we collected additional labels from six board-certified Stanford radiologists on the test set, consisting of 209 musculoskeletal studies. The radiologists individually retrospectively reviewed and labeled each study in the test set as a DICOM file as normal or abnormal in the clinical reading room environment using the PACS system. The radiologists have 8.83 years of experience on average ranging from 2 to 25 years.
The model takes as input one or more views for a study of an upper extremity. On each view, our 169-layer convolutional neural network predicts the probability of abnormality. We compute the overall probability of abnormality for the study by taking the arithmetic mean of the abnormality probabilities output by the network for each image. The model makes the binary prediction of abnormal if the probability of abnormality for the study is greater than 0.5.
The network uses a Dense Convolutional Network architecture, which connects each layer to every other layer in a feed-forward fashion to make the optimization of deep networks tractable. We replace the final fully connected layer with one that has a single output, after which we apply a sigmoid nonlinearity. We use Class Activation Maps to visualize the parts of the radiograph which contribute most to the model's prediction of abnormality.
Model performance is higher than the best radiologist performance in detecting abnormalities on finger studies and equivalent on wrist studies. However, model performance is lower than best radiologist performance in detecting abnormalities on elbow, forearm, hand, humerus, and shoulder studies, indicating that the task is a good challenge for future research.
Recall that for each study in the test set, we collected additional normal/abnormal labels from 6 board-certified radiologists. We randomly chose 3 of these radiologists to create a gold standard, defined as the majority vote of labels of the radiologists. We used the other 3 radiologists to get estimates of radiologist performance on the task.
We compared radiologists and our model on the Cohen’s kappa statistic, which expresses the agreement of each radiologist / model with the gold standard, defined as the majority vote of a disjoint group of radiologists. Here, we highlight the best (green) and worst (red) performances on each of the study types and in aggregate.
MURA Dataset, Leaderboard and Instructions coming soon!
We will be making our code public by February 2018.