Comparison of Three Information Sources for Smoking Information in Electronic Health Records

the ONA take:

Narrative text may be the most reliable and comprehensive source for obtaining smoking-related information, while patient-provided information (PPI) could be used as a complementary source for more comprehensive patient data, according to a study published in Cancer Informatics.

For the study, investigators reviewed chart data from a lung cancer cohort of 561 patients aged 15 to 45 years who were diagnosed with 1 of the 21 lung cancer subtypes. For NLP-based identification of smoking status, researchers extracted smoking-related information from the Mayo Clinical electronic medical record (EMR). Patient-provided smoking information was obtained from structured PPI in

EMRs, and the diagnosis code was extracted from hospital billing information used to group patients as ever smokers or never smokers.

Results showed that NLP alone has the best overall performance for extracting smoking status information, but combining PPI with NLP further enhanced patient coverage. Investigators found that ICD-9 does not provide improved extraction when added to NLP with or without PPI. For smoking strength, combining NLP with PPI was slightly better than using NLP alone.

Smoking status for people aged 13 years or older is one core criteria for meaningful use of electronic medical records.

ABSTRACT

Objective: The primary aim was to compare independent and joint performance of retrieving smoking status through different sources, including narrative text processed by natural language processing (NLP), patient-provided information (PPI), and diagnosis codes (ie, International Classification of Diseases, Ninth Revision [ICD-9]). We also compared the performance of retrieving smoking strength information (ie, heavy/light smoker) from narrative text and PPI.

Materials and methods: Our study leveraged an existing lung cancer cohort for smoking status, amount, and strength information, which was manually chart-reviewed. On the NLP side, smoking-related electronic medical record (EMR) data were retrieved first. A pattern-based smoking information extraction module was then implemented to extract smoking-related information. After that, heuristic rules were used to obtain smoking status-related information. Smoking information was also obtained from structured data sources based on diagnosis codes and PPI. Sensitivity, specificity, and accuracy were measured using patients with coverage (ie, the proportion of patients whose smoking status/strength can be effectively determined).

Conclusion: These findings suggest that narrative text could serve as a more reliable and comprehensive source for obtaining smoking-related information than structured data sources. PPI, the readily available structured data, could be used as a complementary source for more comprehensive patient coverage.

Funding: The authors gratefully acknowledge the support from the National Institute of Health (NIH) grants R01GM102282-03 and R01 LM011934-02. The authors confirm that the funder had no influence over the study design, content of the article, or selection of this journal.