MEDI — Indications extracted from RxNorm, SIDER 2, MedlinePlus, and Wikipedia were integrated into a single resource. The high-precision subset (indications in RxNorm or two other resources) includes 13,304 unique indications for 2,136 medications [2]. Further work added indication prevalence information [3]. MEDI compares favorably to SemRep for extracting indications from clinical text [4].

SemRep — "SemRep is a program that extracts semantic predications (subject-relation-object triples) from biomedical free text" [5]. SemRep has been used to extract TREAT relations from MeSH scope notes, Daily Med, DrugBank, and AHFS Consumer Medication Information[6]. SemRep has also been used to identify TREAT relations from Medline abstracts [7]. A project called SemMedDB provides the SemRep results from mining PubMed [8].

SPL-X — Structured Product Labels eXtractor — Using MetaMap, this project extracted indications from DailyMed drug labels that were available as XML [9]. Data does not appear to be available.

SIDER 2 — In addition to extracting side effects from drug labels, SIDER also extracts indications [11]. Since the approach is automated, some side effects may be extracted as indications and vice versa. This approach would only provide information for drugs with labels from the US FDA or Canada.

The initial LabeledIn [1] resource used expert curators. The team behind this project tested crowdsourced curation using Amazon Mechanical Turk workers [2]. They found the majority vote of workers on whether a disease within a label was an indication had a high accuracy (96%).

They assessed 3004 indications not already in LabeledIn corresponding to 706 new drug labels. We are looking to increase the coverage of the initial LabeledIn dataset by adding these crowdsourced indications.

Hey @b_good, thanks for the suggestion [1] and tracking down the data supplement, which I cannot find on the article's JAMIA page. Hereon, I will refer to this resource as ehrlink, unless anyone can find a previously-used or author-preferred nickname.

This resource is noteworthy because it will capture off-label usages better than LabeledIn (which is explicitly on-label) and MEDI (whose inclusion criteria likely favor on-label indications)

You can access SemRep extracted semantic relations (e.g. treats, causes) based on all PubMed abstracts (updated bi-annually) via the semantic medline database. With a UMLS login, you can get the complete MySQL dump via http://skr3.nlm.nih.gov/SemMedDB/ . Main challenge here is in ensuring quality (as with any NLP output).

Daniel Himmelstein: Added the reference to my initial post. Given the quality issues, I do not plan to include this resource in our gold standard set of indications. It could be helpful later as a literature-derived set of potential indications.

ehrlink problem and medication vocabularies

We have extracted the ehrlink [1] indication data (see above). Unfortunately, I am unfamiliar with the identifiers used for problems (diseases) and medications (drugs). I've posted a sampling below in case anyone can figure out.

problem_definition_id

problem

63645

Complete D-transposition Of The Great Vessels

258894

Acromegaly

275590

Organic REM Sleep Behavior Disorder

62983

Arteriosclerotic Cardiovascular Disease (ASCVD)

75090

Cerebral Palsy

medication_definition_id

medication

17938

Sodium Polystyrene Sulfonate Oral Powder

21707

Clotrimazole Anti-Fungal 1 % External Cream

18805

Niacin CR 1000 MG Oral Tablet Extended Release

19598

ClonazePAM 0.5 MG Oral Tablet

136143

AmLODIPine Besylate 2.5 MG Oral Tablet

My worry is that these identifiers may not correspond to a standardized vocabulary that we can access and easily map to. I will contact the authors for clarification.

Just to let you guys know that, at UNM, Oleg Ursu and I have been constructing such a catalog for nearly eight years. Unfortunately, nobody funds this type of activity - or at least nobody has funded it so far - thus resources are somewhat limited.Briefly, we manually curated all the active pharmaceutical ingredients APIs (over 4400; includes biologics), and mapped them to FDA approved drug labels (over 50000 ADLs). From the ADLs one can extract/map indications, contra-indications, off-label indications... and to each API we mapped RxNorm [CUI], NPC, ATC, INN, plus targets, including numeric bioactivity & type [MoA related; non-MoA assigned; as well as non-human targets]. We also mapped all our diseases to DOIDs - however, there are about 800 or so left that will take us a while to map.A few pointers: 1) if you want to extract the data yourselves, you're in for a treat. There are diseases in "indications" that do NOT exist anywhere else [e.g., cancer XYZ with mutation A3999B, in other words it's not enough to have the disease, you need the right genotype!]; 2) you also have to deal with indications that are "fringe" (pregnancy is not a disease; neither is contraception)3) indications etc. are not from PubMed - so please pay attention to approved labels4) disease modifying is far from trivial - you need epi to show you that, X years after the Dx/Rx event, there was no recurrence [are steroids in anti-allergy disease modifying? probably not; are antibiotics in sinusitis disease-modifying? yes and no link_ref,[object Object],if it's chronic!

In the JAMIA paper mentioned above, we used what we called a crowdsourcing approach to get this data. We have recently validated that approach at another site, and that publication is coming out in ACI soon. Unfortunately, in the original version, as you suspected, our medications and problems not mapped to any standardized terminology. The identifiers are local to the EHR, and while we have made some attempts to map them to RxNorm and SNOMED-CT, we were never able to get a really accurate set. However, the validation uses data from a different EHR, which I believe can be more easily mapped. Once the paper is out, I'll see if I can share that data.

I find crowdsourcing useful when you use a team of experts. So, for example, a carefully selected team of experts, when working on the same problem, can give surprisingly interesting feedback on an otherwise difficult problem.http://www.nature.com/nchembio/journal/v5/n7/abs/nchembio0709-441.htmlPlease note that this paper is not about data entry, but about polling experts for their opinion.

I professionally supervised data entry for chemical structures, chemical bioactivities, as well as controlled vocabulary descriptions for assays, indexing medicinal chemistry literature. The average trained person loading data had an error rate of 5-10% - errors varied with period (e.g., the closer to the deadline, usually Christmas, the worse the quality). We used a 3-layer quality control system. And even so, we had a 1-2% error in our database, as revealed by comparison with two other systems. See this paper http://pubs.acs.org/doi/abs/10.1021/ci400099q for details (mine is the WOMBAT database).

With this in mid, I want to point out that crowdsourcing problem medication pairs by clinicians is an intriguing effort, and if the data is publicly available I would like to learn more. There are risks because a) verification of data entry was probably not done at the entry level (was the clinician familiar with both the drug and the disease?); b) the person determining the problem would require training in pharmacovigilance, understanding of known side-effects, etc. I assume you have done that, and that you compared the sets? I apologize that I do not have time to access your papers right now.

To clarify the crowdsourcing approach, in our study the clinicians are completing the task because it is required during routine care, not solely for the purpose of creating a knowledge base. They are entering the data into the EHR because they are prescribing a medication to a patient and are often required to link it to one or more of the patient's problems for billing purposes. We did not ask them to do any additional work outside of their own routine clinical practice.

My colleagues and I have worked on multiple approaches to create this knowledge in the papers [1, 2, 3]

@allisonmccoy, thanks for the references. I like your approach because it captures what clinicians are actually using to treat diseases (and can provide indication prevalence — what percent of patients with problem X receive medication X). Too bad that the identifiers are local. We would definitely appreciate the validation data when available, especially if it can be mapped to standard terminologies.

In terms of the mappings from the aforementioned study [2], we still may be able to extract some utility: for example, we could manually map indications for diseases where our indications were lacking. @allisonmccoy, did any of the other papers you highlighted release data that could add value here?

@TIOprea mentioned the difficulty of identifying disease-modifying indications, even in a carefully hand-curated database. @allisonmccoy, does your method favor disease-modifying links? For example, if modafinil were prescribed to treat MS-induced fatigue, would the clinicians link modafinil to multiple sclerosis or fatigue?

The 2nd reference uses RxNorm, SNOMED-CT, and NDF-RT, all of which is freely available, so that knowledge base could easily be regenerated by another party.

Does your method favor disease-modifying links? For example, if modafinil were prescribed to treat MS-induced fatigue, would the clinicians link modafinil to multiple sclerosis or fatigue?

It could be either, but more than likely it would be linked to MS, because that's what would be on the problem list already and easily linked during e-prescribing, but in our evaluation, we would have counted either as correct. We actually had a lot of discussion about this while doing the evaluations, because it did occur frequently.

@TIOprea, thanks for your insights. You touch on important points. In general our method may not require a perfect indication catalog to succeed, so I am hopeful despite the difficulties you mention. Specifically,

There are diseases in "indications" that do NOT exist anywhere else [e.g., cancer XYZ with mutation A3999B, in other words it's not enough to have the disease, you need the right genotype!]

In this case, "cancer XYZ with mutation A3999B" would likely not be in the Disease Ontology and if it were would probably lack cross-references. However, if the disease did map to the DO, we would propagate the indication to "cancer XYZ".

you also have to deal with indications that are "fringe" (pregnancy is not a disease; neither is contraception)

These indications would not make it into the network because they do not relate to an included disease term. Information loss is ); but we'll get over it (;

indications etc. are not from PubMed - so please pay attention to approved labels

Thanks for the perspective. We won't include these as part of our gold standard.

disease modifying is far from trivial - you need epi to show you that

This I think will be the biggest difficulty. One option could be to exclude drugs that mostly treat symptoms. We noticed that drugs with many indications tended to be of this category. For multiple sclerosis, disease modifying is an established concept with currently 12 drugs. Unfortunately, the MS indications we've extracted from MEDI and LabeledIn are predominantly symptomatic. And to make matters worse, for most other diseases the DM status seems much more poorly defined.

The 2nd reference [1] uses RxNorm, SNOMED-CT, and NDF-RT, all of which is freely available, so that knowledge base could easily be regenerated by another party.

@allisonmccoy, I believe when MEDI [2] extracts RxNorm indications, they are taking information from the NDF-RT. My belief is based on that in the introduction they state:

The integration of RxNorm with the National Drug File–Reference Terminology (NDF-RT) from the Veterans Health Administration has added significant indication information between single-ingredient medications and diseases through ‘may_treat’ and ‘may_prevent’ therapeutic relationships. NDF-RT includes both on-label and off-label indications, but its performance on indications has not been previously reported. Preliminary work with earlier versions of RxNorm and NDF-RT demonstrated that a number of medications were lacking indications.

Then in the methods they state:

To obtain indications of a medication from RxNorm, we retrieved all diseases that connect with the medication through either ‘may_be_treated_by’ or ‘may_be_prevented_by’ relationships.

Do you know whether the RxNorm portion of MEDI relied on the same underlying NDF-RT data that you collected for the 2011 AMIA Proceedings Paper [1]?

Do you know whether the RxNorm portion of MEDI relied on the same underlying NDF-RT data that you collected for the 2011 AMIA Proceedings Paper [1]?

@dhimmel We only used the may_treat relationship, but we also took advantage of the is_a hierarchy for problems and ingredient_of relationships between medications and expanded the original set of pairs. So there is some overlap between the two, but likely some pairs that exist in only one or the other.

Daniel Himmelstein: Thanks for the clarification. We also plan to perform some indication propagation on the Disease Ontology hierarchy.

PREDICT Indications

An existing computational repurposing approach called PREDICT [1], compiled indications for their analysis. They describe their approach as:

The associations between drugs and UMLS disease concepts were integrated from four different sources using three different methods: (i) direct mapping to drugs, exploiting embedded UMLS links between concepts and drugs; (ii) drug–condition associations downloaded from http://drugs.com, where conditions were mapped to UMLS concepts using MetaMap; and (iii) indication‐based mapping. For the latter, we extracted UMLS concepts using the MetaMap tool from textual drug indications downloaded from FDA package inserts (available in the DailyMed website, http://dailymed.nlm.nih.gov) and DrugBank. In addition, we manually added 44 associations occurring in phase IV (post‐marketing) clinical trials.

... Finally, performing a manual curation of the extracted UMLS concepts from textual description of drug indications, we observed that they are more prone to false positives. We thus required that associations extracted from drug indications appear also in at least one more source.

Compounds are from DrugBank and diseases are from OMIM and the UMLS, which are both cross-referenced by the DO. The study does not report the precision of their indications making it difficult to assess how the quality compares with MEDI-HPS and LabeledIn.

We combined the supplementary datasets from the study to create a table of PREDICT indications (notebook, download). We will further investigate including these indications.

We anticipate constructing our gold standard of indications from MEDI-HPS, LabeledIn, and PREDICT while omitting MEDI-LPS, which has a lower precision. We did not include ehrlink [1] because the vocabularies were not mapped. However, we would happily reward anyone who contributes a mapping of the problems to the DO and the medications to DrugBank.

I spend some time solving your problem about mapping the drug names from one arbitrary system to a known ontology. As a matter of fact RxNorm proposes an API which has an endpoint that can directly be queried for fuzzy matching - so that's useful. It will be helpful to look into the different endpoints of the API down the road, they provide many useful features (though poorly documented).

I wrote a script (in R) to match all the medication names in your file and get the related properties of the retrieved Rx concepts. It took an hour + to run because of stalling to avoid going over API quotas. The fuzzy-matching API returns several rxcui matches for each medication. A score, ranging from 0 to 100, is attached to every match.

The main output file can have several concepts per medication names if (i) there is ambiguity, i.e. there is more than one best match for a medication or (ii) the best match is imperfect, i.e. the best score is not 100 (then the first three are reported).The final output is a subset of this file with only the huge majority of unambiguous hits (and we thus have one concept per medication string).

Here are some numbers: 1. Only 2353 medications got matched with a valid concept, out of 2537 initially. Some names don't correspond to any medication and are filtered out. 2. From these 2353 matched medications, 2281 (97%) have an unambiguous first match. These are in the final output. 3. These 2281 unambiguous hits match to a total of 2148 different rxcuis. 4. 1490 (63%) medications have at least one perfect match, with 1471 (63%) being unambiguous. 5. These 1471 unambiguous perfect hits match to a total of 1442 different rxcuis.

QC is straightforward by comparing the original medication names and the retrieved name of each matched rxcui. I quickly checked and even the non-perfect matches (with a score different than 100) seems on point, at the exception of the "therapies" that have very few equivalents in RxNorm and definitely match to the wrong concept.

Potential future directions:1. Assess the quality of the matches through a systematic check based on the QC file mentioned above. 2. Enrich the final dataset by resolving ambiguity from the term types reported in the rxcui properties.

UPDATE:I went forward on resolving the ambiguity, using the term source in type, and then the number of "atoms" that matches each medication name.

This brings down the number of remaining ambiguity from 72 to 11 medications (0.5%).

I understand you want to extract the ingredients from these concepts, so it doesn't necessarily matter that there are two "top" matches for one medication after trying to resolve ambiguity (both will likely lead to the same components). As a result I created both the file for the successfully resolved matches, and the file for all the best matches after trying to resolve them. The latter has 100% of the medications, including the 11 ambiguous, for which I took arbitrarily one of the top concepts. This is the file you'll want to work off in the future.

Expert curation of the indication catalog

We have decided to filter our catalog for disease-modifying indications and are seeking an expert curator to assist with this task. We started a new discussion for this next step.

@allisonmccoy, have you thought more about releasing the data from your recent publication [1]? If you can do this in the next week or two, we would be thrilled to include this data. Otherwise we will have to move ahead with only the ehrlink data from your initial study [2].

Therapeutic Target Database

The Therapeutic Target Database (TTD) is a target focused resource with pharmacological relationships. @janispisuggested we check out TTD as a source of drug–disease therapies.

Specifically, TTD has a dataset of indications, which range from approved to investigational, available online (drug-disease_TTD2016.txt). I couldn't find how these indications were constructed from their publications [1, 2, 3, 4, 5], although I may have missed it. I emailed Professor Yu Zong (csccyz@nus.edu.sg), who indicated their drug-disease relationships were human curated.

Just wanted to note this information, so we remember to keep TTD in mind.

The mapping between diseases and drugs were done manually. We searched different sources of literature such as pharmacology textbooks, review articles and research papers. The methods to extract the related drug target and disease information from literature were described in the 2012 version of TTD update paper [3]. We mapped the disease information to ICD code in the 2014 update of TTD [4].

Cheng et al 2014

A 2014 study titled "Systematic evaluation of connectivity map for disease indications" compiled 890 indications between 152 drugs and 145 diseases [1]. They compiled the indications from FAERS and Pharmaprojects. The indications are available as free text in Table S2 of the supplementary word document. I copied Table S2 into a TSV available here.

repoDB

contains a standard set of drug repositioning successes and failures that can be used to fairly and reproducibly benchmark computational repositioning methods. repoDB data was extracted from DrugCentral and ClinicalTrials.gov.

The data is available on figshare [2] under a CC BY 4.0 license and contains "1,571 drugs and 2,051 UMLS disease concepts, accounting for 6,677 approved and 4,123 failed drug-indication pairs." This dataset will be useful for distinguishing clinical trial indications that result in the following statuses: "Approve, Program Terminated, Not Approved, or Trial Halted". Unfortunately, there is no machine-readable way to determine whether the failures resulted from lack of efficacy.

RepurposeDB

The RepurposeDB database was recently released by the Dudley Lab [3]. RepurposeDB aims to catalog historical instances of drug repurposing in a machine readable and standardized format.

The database is at it's core a bunch of triples consisting of a:

drug

primary indication

secondary indication

where primary/secondary indication is defined as:

Primary indication refers to the original disease indication for which the drug is targeted, and secondary indication indicates any subsequent indications

The first version (v1) of the resource (dated March 30, 2016) contains 253 drugs (188 small molecules & 65 biologics), 1125 indications, and 3660 data triples. The code is available on bitbucket and some data is available on figshare [4]. Unfortunately, the actual catalog of triples is not available on Figshare, but the authors provided us with the datasets without any restrictions attached (i.e. licensed under CC0).

Drug-Indication Database

Scientist(s) at Merck created the drug-indication database (DID) by integrating 12 resources [5]. While some of this resource is proprietary, much of it has been released via figshare under a CC BY license [6]. The dataset with indications looks a bit complex, but with some munging there are likely many good indications in there.