Title of Presentation
“CRIP.CodEx: Knowledge extraction from free text medical records”
Date and Place
Oliver Gros is heading the unit Metabiobanks CRIP at the Fraunhofer Institute for Cell Therapy and Immunology, Branch Bioanalytics and Bioprocesses (IZI-BB) in Potsdam, Germany. The unit develops and implements infrastructure – based on the CRIP Privacy-Regime and approved by German data protection authorities – for networked biomedical research consortia. By integrating biobanks into so-called metabiobanks, the unit enables the web-based case-by-case and sample-by-sample search for human biospecimens and associated data for networked research across institutional and national borders.
Biobanks represent key resources for translational research and personalized medicine. To be a valuable resource for translational biomedical research, samples have to be annotated with clinical data, often only available from free text records. To enable access to these biospecimens and data – e.g. over platforms like Arevir (Roomp et al., 2006), CRIP (Schröder et al., 2011) or the Fraunhofer Metabiobank (www.metabiobank.fraunhofer.de) – via dynamic stratified parameterized project queries, it is mandatory to integrate knowledge corresponding to the stored human biospecimens from various sources, including free text records, into harmonized and structured data (Ambert and Cohen, 2009). The automated knowledge extraction software CRIP.CodEx was designed to identify and extract diagnostic information in free text medical records using text mining technologies, essentially enrich parameterized annotation of cases/specimens and assign corresponding codes (e.g. ICD, ICD-O, TNM). CRIP.CodEx efficiently identifies word relations and negation, handles extended negation scopes (Gros and Stede, 2013), but does not need access to databases or other external resources. Using unsupervised learning, no manual entry of rules and no training of classificators is needed, thus achieving a head start when applied to different subdomains, new knowledge corpora, or when deployed in new settings. By tapping complementary data sources, e.g. free text pathology reports or diagnoses, we delivered a system to provide biobanks with information out of previously unstructured – and therefore hidden data. Thereby we have enriched parameterized annotation of stored biospecimens, increasing the visibility of the samples and data and enhancing their availability for translational research.