Abstract

Genome-wide association studies have identified genetic variants for thousands of diseases and traits. We evaluated the relationships between specific risk factors (for example, blood cholesterol level) and diseases on the basis of their shared genetic architecture in a comprehensive human disease-single-nucleotide polymorphism association database (VARIMED), analyzing the findings from 8962 published association studies. Similarity between traits and diseases was statistically evaluated on the basis of their association with shared gene variants. We identified 120 disease-trait pairs that were statistically similar, and of these, we tested and validated five previously unknown disease-trait associations by searching electronic medical records (EMRs) from three independent medical centers for evidence of the trait appearing in patients within 1 year of first diagnosis of the disease. We validated that the mean corpuscular volume is elevated before diagnosis of acute lymphoblastic leukemia; both have associated variants in the gene IKZF1. Platelet count is decreased before diagnosis of alcohol dependence; both are associated with variants in the gene C12orf51. Alkaline phosphatase level is elevated in patients with venous thromboembolism; both share variants in ABO. Similarly, we found that prostate-specific antigen and serum magnesium levels were altered before the diagnosis of lung cancer and gastric cancer, respectively. Disease-trait associations identify traits that could serve as future prognostics, if validated through EMR and subsequent prospective trials.

Abstract

Understanding of cancer outcomes is limited by data fragmentation. In the current study, the authors analyzed the information yielded by integrating breast cancer data from 3 sources: electronic medical records (EMRs) from 2 health care systems and the state registry.Diagnostic test and treatment data were extracted from the EMRs of all patients with breast cancer treated between 2000 and 2010 in 2 independent California institutions: a community-based practice (Palo Alto Medical Foundation; "Community") and an academic medical center (Stanford University; "University"). The authors incorporated records from the population-based California Cancer Registry and then linked EMR-California Cancer Registry data sets of Community and University patients.The authors initially identified 8210 University patients and 5770 Community patients; linked data sets revealed a 16% patient overlap, yielding 12,109 unique patients. The percentage of all Community patients, but not University patients, treated at both institutions increased with worsening cancer prognostic factors. Before linking the data sets, Community patients appeared to receive less intervention than University patients (mastectomy: 37.6% vs 43.2%; chemotherapy: 35% vs 41.7%; magnetic resonance imaging: 10% vs 29.3%; and genetic testing: 2.5% vs 9.2%). Linked Community and University data sets revealed that patients treated at both institutions received substantially more interventions (mastectomy: 55.8%; chemotherapy: 47.2%; magnetic resonance imaging: 38.9%; and genetic testing: 10.9% [P < .001 for each 3-way institutional comparison]).Data linkage identified 16% of patients who were treated in 2 health care systems and who, despite comparable prognostic factors, received far more intensive treatment than others. By integrating complementary data from EMRs and population-based registries, a more comprehensive understanding of breast cancer care and factors that drive treatment use was obtained.

Abstract

Alzheimer's disease (AD) is one of the leading causes of death for older people in US with rapidly increasing incidence. AD irreversibly and progressively damages the brain, but there are treatments in clinical trials to potentially slow the development of AD. We hypothesize that the presence of clinical traits, sharing common genetic variants with AD, could be used as a non-invasive means to predict AD or trigger for administration of preventative therapeutics. We developed a method to compare the genetic architecture between AD and traits from prior GWAS studies. Six clinical traits were significantly associated with AD, capturing 5 known risk factors and 1 novel association: erythrocyte sedimentation rate (ESR). The association of ESR with AD was then validated using Electronic Medical Records (EMR) collected from Stanford Hospital and Clinics. We found that female patients and with abnormally elevated ESR were significantly associated with higher risk of AD diagnosis (OR: 1.85 [1.32-2.61], p=0.003), within 1 year prior to AD diagnosis (OR: 2.31 [1.06-5.01], p=0.032), and within 1 year after AD diagnosis (OR: 3.49 [1.93-6.31], p<0.0001). Additionally, significantly higher ESR values persist for all time courses analyzed. Our results suggest that ESR should be tested in a specific longitudinal study for association with AD diagnosis, and if positive, could be used as a prognostic marker.

Abstract

To address the challenge of balancing privacy with the need to create cross-site research registry records on individual patients, while matching the data for a given patient as he or she moves between participating sites. To evaluate the strategy of generating anonymous identifiers based on real identifiers in such a way that the chances of a shared patient being accurately identified were maximized, and the chances of incorrectly joining two records belonging to different people were minimized.Our hypothesis was that most variation in names occurs after the first two letters, and that date of birth is highly reliable, so a single match variable consisting of a hashed string built from the first two letters of the patient's first and last names plus their date of birth would have the desired characteristics. We compared and contrasted the match algorithm characteristics (rate of false positive v. rate of false negative) for our chosen variable against both Social Security Numbers and full names.In a data set of 19 000 records, a derived match variable consisting of a 2-character prefix from both first and last names combined with date of birth has a 97% sensitivity; by contrast, an anonymized identifier based on the patient's full names and date of birth has a sensitivity of only 87% and SSN has sensitivity 86%.The approach we describe is most useful in situations where privacy policies preclude the full exchange of the identifiers required by more sophisticated and sensitive linkage algorithms. For data sets of sufficiently high quality this effective approach, while producing a lower rate of matching than more complex algorithms, has the merit of being easy to explain to institutional review boards, adheres to the minimum necessary rule of the HIPAA privacy rule, and is faster and less cumbersome to implement than a full probabilistic linkage.

Abstract

Use of terminology standards facilitates aggregating data from multiple sources for information retrieval, exchange and analysis. However, medical vocabularies are continuously updated and incorporating those changes consistently into clinical data warehouses requires rigorous methodology. To integrate pharmacy data from two hospital pharmacy information systems the Stanford Translational Research Integrated Database Environment (STRIDE) project mapped medication orders to RxNorm content using the RxNorm drug model. In order to keep the data relevant and up-to-date, we developed a strategy for updating to RxNorm, while preserving the original meaning and mapping of the legacy data. This case study discusses managing the vocabulary update by following the RxNorm content maintenance strategy and supplementing it with operations to retain access to its drug model information.

Abstract

The Stanford Translational Research Integrated Database Environment (STRIDE) clinical data warehouse integrates medication information from two Stanford hospitals that use different drug representation systems. To merge this pharmacy data into a single, standards-based model supporting research we developed an algorithm to map HL7 pharmacy orders to RxNorm concepts. A formal evaluation of this algorithm on 1.5 million pharmacy orders showed that the system could accurately assign pharmacy orders in over 96% of cases. This paper describes the algorithm and discusses some of the causes of failures in mapping to RxNorm.

Abstract

STRIDE (Stanford Translational Research Integrated Database Environment) is a research and development project at Stanford University to create a standards-based informatics platform supporting clinical and translational research. STRIDE consists of three integrated components: a clinical data warehouse, based on the HL7 Reference Information Model (RIM), containing clinical information on over 1.3 million pediatric and adult patients cared for at Stanford University Medical Center since 1995; an application development framework for building research data management applications on the STRIDE platform and a biospecimen data management system. STRIDE's semantic model uses standardized terminologies, such as SNOMED, RxNorm, ICD and CPT, to represent important biomedical concepts and their relationships. The system is in daily use at Stanford and is an important component of Stanford University's CTSA (Clinical and Translational Science Award) Informatics Program.

Abstract

Traditionally, the elucidation of genes involved in maturation and aging has been studied in a temporal fashion by examining gene expression at different time points in an organism's life as well as by knocking out, knocking in, and mutating genes thought to be involved. Here, we propose an in silico method to combine clinical electronic medical record (EMR) data and gene expression measurements in the context of disease to identify genes that may be involved in the process of human maturation and aging. First we show that absolute lymphocyte count may serve as a biomarker for maturation by using statistical methods to compare trends among different clinical laboratory tests in response to an increase in age. We then propose using the rate of decay for absolute lymphocyte count across 12 diseases as a proxy for differences in aging. We correlate the differing rates with gene expression across the same diseases to find maturation/aging related genes. Among the 53 genes with strongest correlations between expression profile and change in rate of decay, we found genes previously implicated in the process of aging, including MGMT (DNA repair), TERF2 (telomere stability), POLD1 (DNA replication and repair), and POLG (mtDNA replication).

Abstract

The severity of diseases has often been assigned by direct observation of a patient and by pathological examination after symptoms have appeared. As we move into the genomic era, the ability to predict disease severity prior to manifestation has improved dramatically due to genomic sequencing and analysis of gene expression microarrays. However, as the severity of diseases can be exacerbated by non genetic factors, the ability to predict disease severity by examining gene expression alone may be inadequate. We propose the creation of a "clinarray" to examine phenotypic expression in the form of clinical laboratory measurements. We demonstrate that the clinarray can be used to distinguish between the severities of patients with cystic fibrosis and those with Crohn's disease by applying unsupervised clustering methods that have been previously applied to microarrays.