Collecting data when missingness is unknown: a method for improving model performance given under-reporting in patient populations

Kevin Wu* (Stanford University and Optum Labs), Dominik Dahlem (Optum Labs), Christopher Hane (Optum Labs), Eran Halperin (Optum Labs), James Zou (Stanford University)

Abstract: Machine learning models for healthcare commonly use binary indicator variables to represent the diagnosis of specific health conditions in medical records. However, in populations with significant under-reporting, the absence of a recorded diagnosis does not rule out the presence of a condition, making it difficult to distinguish between negative and missing values. This effect, which we refer to as latent missingness, may lead to model degradation and perpetuate existing biases in healthcare. To address this issue, we propose that healthcare providers and payers allocate a budget towards data collection (eg. subsidies for check-ups or lab tests). However, given finite resources, only a subset of data points can be collected. Additionally, most models are unable to be re-trained after deployment. In this paper, we propose a method for efficient data collection in order to maximize a fixed model's performance on a given population. Through simulated and real-world data, we demonstrate the potential value of targeted data collection to address model degradation.