Mining the Clinical Narrative: All Text Are Not Equal.
Over the past decade, the application of data science techniques to clinical data has allowed practitioners and researchers to develop a sundry of analytical models. These models have traditionally relied on structured data drawn from Electronic Medical Records (EMR). Yet, a large portion of EMR data remains unstructured, primarily held within clinical notes. While recent work has produced techniques for extracting structured features from unstructured text, this work generally operates under the untested assumption that all clinical text can be processed in a similar manner. This paper provides what we believe to be the first comprehensive evaluation of the differences between four major sources of clinical text, providing an evaluation of the structural, linguistic, and topical differences among notes of each category. Our conclusions support the premise that tools designed to extract structured data from clinical text must account for the categories of text they process.