The Metastatic streaming engine was developed to identify inpatient encounters with high risk of metastatic disease (cancer that spreads from where it started to a distant part of the body) and optimize the capture rate. It was deployed at Mount Sinai Hospital on April 2022.
Clinical Documentation Integrity (CDI)
Core of every patient encounter is clinical documentation accurately reflecting the patient’s disease burden and scope of services provided.
Clinical documentation must be :
CDI facilitates the accurately translation of patient’s clinical status into coded data resulting in quality reporting, physician report cards, reimbursement, public health data, disease tracking and trending, and medical research.
The convergence of clinical care , documentation , and coding process is crucial for appropriate reimbursement, accurate quality scores, and informed decision-making to support high-quality patient care.
High volume of clinical notes and the cost of processing
Manual chart review of cancer patients to identify new metastatic disease is inefficient due to
time required
limited number of patients assessed
difficulty identifying these patients prior to treatment
Information to quickly and accurately identify patients with metastatic disease is typically available only in clinical text documents (particularly radiology reports)
Complexity of language expression and inconclusive text to express uncertain or negative condition makes the NLP task very challenging 1
Building an exhaustive list of terms and rules to model language and extract domain concepts is extremely time consuming
High class-imbalance => low productivity
Clinical documentation improvement opportunity based on benchmarking reports
Current Approach and Solution
Search Algorithms
term/string matching and document indexing (
“metastatic”, “metastasis”, “metastases” and “carcinomatosis”
DNNs for medical NLP (Language models: embeddings)
Relation Extraction (REX)2
Lexicon Mediated Entropy Reduction (LEXIMER) system
Medical Language Extraction and Encoding Sys- tem (MEDLEE)
It uses a controlled vocabulary and grammatical rules to translate text to a structured database format
Low generalizability 3,4
Radiology Analysis tool (RADA) 5
Mayo Clinic’s Clinical Text Analysis and Knowledge Extraction System (cTAKES)
a dictionary-based named-entity recognizer to highlight the Unified Medical Language System (UMLS) Metathesaurus terms in text, in addition to other NLP functionalities, such as tok- enizing, part of speech tagging, and parsing 6
Health Information Text Extraction (HITEx) from Brigham and Women’s Hospital and Harvard Medical School
It Creates tag for principal diagnoses 7
Named Entity Recognition (NER)2
dictionary-based method
conditional Markov model (CMM) 13
Sequence classifier
probabilities in CMMs are normalized locally for each state in the sequence
conditional random field model (CRF)
Sequence classifier
Probabilities in CRFs are normalized globally for a sequence
Information model 8,9
anatomy: “Right upper lobe”
anatomy modifier: “Anterior”
observation: “Mass”
observation modifier: “Calcified”,“1 cm”
uncertainty:“Probably is present”
This information model has a hierarchical structure
The annotation tool => eHOST:
eHOST: The Extensible Human Oracle Suite of Tools an open source annotation tool
Stanford Part of Speech Tagger
RadLex: RadLex lexicon is organized in a hierarchal structure and available in Web Ontology Language (OWL) format. 10
cTAKES dictionary-based named-entity recognition methodology in this work 12 (
Using CMM and CRF train- ing infrastructure in Stanford Named-Entity Recognizer toolkit 11
Optimization opportunity and Goal
Opportunity: Identify inpatient encounters with high risk of metastatic disease and optimize the capture rate
Goal: Develop a ML based CDI tool to flag the inpatient encounters with high risk of metastatic disease at Discharge day and send the notification to the CDI specialist
Expected Impacts
Improve coding accuracy
Improve reimbursement opportunities
Improve comorbidity Score => Improve Elixhauser Comorbidity Index
Improve PSIs monitoring
Proposed Solution
This tool automatically screens patient’s clinical notes (Care Notes and Progress Notes) and reports (Radiology and Pathology) at discharge time for rapid identification of patients with metastatic disease
The machine learning information extraction approach provides an effective automatic method to annotate and extract clinically significant information from a large collection of free text and use a ML classifier to identify the patients with high risk of new metastasis
High Level Operationalization Workflow
Batch Computational Flow
we use discriminative sequence classifiers for named-entity recognition to extract and organize clinically significant terms and phrases consistent with the information model.
Feature Engineering Flow
Labeling Logic
Proposed Key Performance Indicators (KPIs)
Chart review rate ==> will be captured by redcap response
Query rate
Provider response rate ==> will be captured by redcap response
Provider agreement rate ==> will be captured by redcap response
Unable to determine rate
Active Pilot Workflow
there are two types of metastatic patients:
documented and captured by 3M software
undocumented and will be captured by the CDI team review ==> only this category will be sent into the recap for being scanned by the NLP application
Chirag M Lakhani1 2, Arjun K Manrai1, 3, Jian Yang4, 5, Peter M Visscher#4, 5,*, and Chirag J Patel#1, 1Department BTT. 乳鼠心肌提取 HHS Public Access. Physiol Behav 2019;176:139–48.
Hahn U, Oleynik M. Medical Information Extraction in the Age of Deep Learning. Yearb Med Inform 2020;29:208–20.
Hripcsak George, Kuperman Gilad J, Friedman Carol. Extracting findings from narrative reports: software transferability and sources of physician disagree- ment. Methods Inf Med 1998;37(1):1–7.
Elkins Jacob S, Friedman Carol, Boden-Albala Bernadette, Sacco Ralph L, Hripc- sak George. Coding neuroradiology reports for the Northern Manhattan Stroke Study: a comparison of natural language processing and manual review. Com- put Biomed Res 2000;33(1):1–10.
Johnson David B, Taira Ricky K, Cardenas Alfonso F, Aberle Denise R. Extract- ing information from free text radiology reports. Int J Digit Libr 1997;1(3): 297–308
Savova Guergana K, Masanz James J, Ogren Philip V, Zheng Jiaping, Sohn Sungh- wan, Kipper-Schuler Karin C, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applica- tions. J Am Med Inf Assoc 2010;17(5):507–13.
Goryachev Sergey, Sordo Margarita, Zeng Qing T. A suite of natural language processing tools developed for the I2B2 project. In: Bates David W, editor. Proceedings of the AMIA symposium, vol. 2. Washington DC: American Medical Informatics Association; 2006. p. 931
Langlotz Curtis P, Lee Meininger. Enhancing the expressiveness and usability of structured image reporting systems. In: Marc Overhage J, editor. Proceedings of the AMIA symposium. Los Angeles, CA: American Medical Informatics Asso- ciation; 2000. p. 467
Hassanpour S, Langlotz CP. Information extraction from multi-institutional radiology reports. Artif Intell Med 2016;66:29–39.
Langlotz Curtis P. RadLex: a new method for indexing online educational mate- rials. Radiographics 2006;26(6):1595–7
Finkel Jenny Rose, Grenager Trond, Manning Christopher. Incorporating non- local information into information extraction systems by Gibbs sampling. In: Darwish Kareem, Diab Mona, Habash Nizar, editors. Proceedings of the 43rd annual meeting on association for computational linguistics. Ann Arbor, MI: Association for Computational Linguistics; 2005. p. 363–70
Savova Guergana K, Masanz James J, Ogren Philip V, Zheng Jiaping, Sohn Sungh- wan, Kipper-Schuler Karin C, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applica- tions. J Am Med Inf Assoc 2010;17(5):507–13
Ratnaparkhi Adwait. A maximum entropy model for part-of-speech tagging. In: Brill Eric, Church Kenneth, editors. Proceedings of the conference on empirical methods in natural language processing, vol. 1. Philadelphia, PA: Association for Computational Linguistics; 1996. p. 133–42