It is well known that the area under the receiver operator characteristic curve is misleading in measuring model performance in cases of extreme class imbalance. Recommendations have been made to use alternate performance metrics such as the area under the precision recall curve and the F1 score. But is there a case of extreme class imbalance for which all model performance metrics are misleading? We present such a case based on an extreme class imbalance learning task of predicting severe sepsis at ED Triage.
The goal is to build a gradient boosting model using data from ED encounters for which a Triage Form was filled between November 2014 and February 2019 including patient demographics, history and vital signs while preserving the ability of the model to produce true (predicted) probabilities.
The total ED encounters observed was approximately 370,000 and of these 334 were encounters diagnosed with severe sepsis. Moreover, of the 41 variables considered (age, heart rate, temperature, systolic and diastolic bp, oxygen saturation, immunization record, alternative medical treatments, level of consciousness, Glasgow coma scale, capillary refill, and musculoskeletal systems, etc.) 25 of these variables were selected for the final model. Even if we do not care about the predicted probabilities, experiments indicated that oversampling did not improve model performance in this case.
In this presentation, we will discuss model performance of the gradient boosting model with extreme class imbalance and the solution we developed towards appropriately discussing model performance in such a way that clinical decision making on the adoption of the model is made easier.