Active Learning for Medical Code Assignment

Martha Ferreira (Dalhousie University); Michal Malyska and Nicola Sahar (Semantic Health); Riccardo Miotto (Icahn School of Medicine at Mount Sinai); Fernando Paulovich (Dalhousie University); Evangelos Milios (Dalhousie University, Faculty of Computer Scienc)

Abstract: Machine Learning (ML) is widely used to automatically extract meaningful information from Electronic Health Records (EHR) to support operational, clinical, and financial decision making. However, ML models require a large number of annotated examples to provide satisfactory results, which is not possible in most healthcare scenarios due to the high cost of clinician labeled data. Active Learning (AL) is a process of selecting the most informative instances to be labeled by an expert to further train a supervised algorithm. We demonstrate the effectiveness of AL in multi-label text classification in the clinical domain. In this context, we apply a set of well-known AL methods to help automatically assign ICD-9 codes on the MIMIC-III dataset. Our results show that the selection of informative instances provides satisfactory classification with a significantly reduced training set (8.3\% of the total instances). We conclude that AL methods can significantly reduce the manual annotation cost while preserving model performance.