Bottom Line:
Unfortunately, most of the valuable information on risk factor data is buried in the form of unstructured clinical notes in electronic health records.The hybrid approach employs both machine learning and rule-based clinical text mining techniques.The developed system achieved an overall microaveraged F-score of 0.8302.

Affiliation: School of Public Health and Community Medicine, University of New South Wales, Sydney, NSW 2052, Australia ; Asia-Pacific Ubiquitous Healthcare Research Centre, University of New South Wales, Sydney, NSW 2052, Australia ; Prince of Wales Clinical School, University of New South Wales, Sydney, NSW 2052, Australia.

ABSTRACTHeart disease is the leading cause of death worldwide. Therefore, assessing the risk of its occurrence is a crucial step in predicting serious cardiac events. Identifying heart disease risk factors and tracking their progression is a preliminary step in heart disease risk assessment. A large number of studies have reported the use of risk factor data collected prospectively. Electronic health record systems are a great resource of the required risk factor data. Unfortunately, most of the valuable information on risk factor data is buried in the form of unstructured clinical notes in electronic health records. In this study, we present an information extraction system to extract related information on heart disease risk factors from unstructured clinical notes using a hybrid approach. The hybrid approach employs both machine learning and rule-based clinical text mining techniques. The developed system achieved an overall microaveraged F-score of 0.8302.

Mentions:
The heart disease risk factors system (HDRFSystem) in its current form includes three modules (i) core NLP module, (ii) risk factor recognition module, and (iii) attribute assignment module (Figure 2). The core NLP module identifies sentence boundaries (sentence detector), breaks sentences into tokens (tokenizer), assigns part of speech tags (POS-tagger), and identifies noun phrases (chunker). The core NLP module adopted components from the OpenNLP package (v1.5.3) available at https://opennlp.apache.org/. Processed information from the core NLP module is then passed to the risk factor recognition module where medications, disease disorder mentions, family history, and smoking history are identified. The risk factor recognition module is responsible for identifying all the heart disease risk factors. All the identified risk factors (except family history and smoking history) were then assigned indicator and time attributes by the components in the attribute assignment module. The components of the risk factor recognition module and the time attribute assignment module are explained in more detail in the following sections.

Mentions:
The heart disease risk factors system (HDRFSystem) in its current form includes three modules (i) core NLP module, (ii) risk factor recognition module, and (iii) attribute assignment module (Figure 2). The core NLP module identifies sentence boundaries (sentence detector), breaks sentences into tokens (tokenizer), assigns part of speech tags (POS-tagger), and identifies noun phrases (chunker). The core NLP module adopted components from the OpenNLP package (v1.5.3) available at https://opennlp.apache.org/. Processed information from the core NLP module is then passed to the risk factor recognition module where medications, disease disorder mentions, family history, and smoking history are identified. The risk factor recognition module is responsible for identifying all the heart disease risk factors. All the identified risk factors (except family history and smoking history) were then assigned indicator and time attributes by the components in the attribute assignment module. The components of the risk factor recognition module and the time attribute assignment module are explained in more detail in the following sections.

Bottom Line:
Unfortunately, most of the valuable information on risk factor data is buried in the form of unstructured clinical notes in electronic health records.The hybrid approach employs both machine learning and rule-based clinical text mining techniques.The developed system achieved an overall microaveraged F-score of 0.8302.

Affiliation:
School of Public Health and Community Medicine, University of New South Wales, Sydney, NSW 2052, Australia ; Asia-Pacific Ubiquitous Healthcare Research Centre, University of New South Wales, Sydney, NSW 2052, Australia ; Prince of Wales Clinical School, University of New South Wales, Sydney, NSW 2052, Australia.

ABSTRACTHeart disease is the leading cause of death worldwide. Therefore, assessing the risk of its occurrence is a crucial step in predicting serious cardiac events. Identifying heart disease risk factors and tracking their progression is a preliminary step in heart disease risk assessment. A large number of studies have reported the use of risk factor data collected prospectively. Electronic health record systems are a great resource of the required risk factor data. Unfortunately, most of the valuable information on risk factor data is buried in the form of unstructured clinical notes in electronic health records. In this study, we present an information extraction system to extract related information on heart disease risk factors from unstructured clinical notes using a hybrid approach. The hybrid approach employs both machine learning and rule-based clinical text mining techniques. The developed system achieved an overall microaveraged F-score of 0.8302.