home > ict > winter 2012 > connecting up
International Clinical Trials

Connecting Up

Christopher Southan of ChrisDS Consulting and Hilary Stephenson of Sigma Consulting Solutions Limited take a close look at the name space and molecular mappings of the drug interventions recorded in the database, and fi nd that not all the dots are joined up correctly

The trial registry, launched more than a decade ago, currently lists 117,641 trials, is being populated at the rate of around 150 new trials each week, and receives 65,000 visitors per day. This is clearly a landmark achievement, not only for public access to efficacy, comparisons and safety evaluations of new medicines, but also in advancing new treatments for existing drugs and their combinations. Initially a US initiative, it now encompasses trials from 178 countries and has also seeded the emergence of other resources along the same lines. While a recent review has highlighted general data issues, this article focuses on the features and connectivity of the web interface specifically for identifying the names and molecular details of the drugs specified as interventions (1). It also considers these findings in the wider context of the challenge of resolving the different types of drug names against chemical structures and their associated clinical data in the public domain.


This article can only give a brief introduction, but it is precisely this information space, and its associated ambiguities, that users have to navigate in the trial records. Three basic types of name are used. Company codes are the fi rst external names assigned to candidate drugs as they progress towards clinical trials, for example Parke- Davis CI-981 in the 1980s. This eventually had the WHO International Non-Proprietary Name (INN) approved as ‘atorvastatin’, which later became known by the trade name Lipitor®. However, the US and UK non-proprietary names (USAN and BAN) are actually ‘atorvastatin calcium’. Querying with any of these four names returns (the same) 480 entries, since they execute a synonym look-up at the NLM Drug Information Portal (DIP, http:// Links to this are also assigned as the primary mappings of drug names in the trial records. The consequences of this will be expanded on below, but it immediately raises the molecular mapping dilemma of ‘what Lipitor is’ in the choices between atorvastatin parent monomer (PubChem CID 60823), calcium salt (CID 11227182), hemi-calcium (CID 60822) or, what is actually depicted on the FDA-approved packet insert, hemi-calcium trihydrate (CID 656846). The wider manifestation of this problem of drug molecular identity has been reported in a large comparative database study that found a signifi cant degree of non-overlap in the structures for approved drugs represented in different databases (2).


Before drilling down to their names, the fi rst challenge for drug identifi cation is to separate out authentic pharmacological drug trials from a top-level ‘intervention’ category that includes surveys, education and interviews. The site also indexes 32 trials with ‘pomegranate’ in various forms and combinations. This is classified as a drug in seven of these, the rest being dietary supplements, but there is no record match in DIP. Using the advanced query interface, the following triage was performed: ‘interventional’ = 81 per cent, followed by inclusion of the term ‘drug’ = 62 per cent, Phases 1-4 = 52 per cent, and ‘from industry’ = 29 per cent (34,600 studies from a search of drug interventional studies Phase 1, 2, 3, 4, industry).

To produce a shorter list for examining the molecular identity of the interventions by inspecting individual records, we selected Phase 2 or 3, received on or after 26/11/2011. Checking just the fi rst 20 gave the following crop of company code names for the drug intervention arms: EP-100, Lu AA21004, LDE225, TC-5619, TPI 287, ACT- 129968, XEN402, RO4995855 and FX006. The problem here is that (although this is not a representative sampling of the whole database) around 45 per cent of these names cannot be mapped to a molecular structure. While DIP-to- PubChem will lag behind the most recent code name-tostructure parings in the literature (while the medical subject headings (MeSH) and PubChem catch up) the main cause of these false-negative mappings is competitive obfuscation by delaying (these had already reached Phase 2 or 3) the public declaration of the link between drug candidate and the chemical structure already published in patents. This traditional practice seems increasingly anachronistic, not only in the context of the imperative to ‘open up’ clinical and translational data, but also in that trial authorisations (and inclusion in the database) appear not to be predicated on the approval of an INN. This would include a depiction of the new chemical entity (NCE) in the WHO-issued PDF as a molecular structure image linked to a systematic chemical name. The results of any drug trial are of little external use without this mapping.


The entry NCT01483014 is a Phase 2 study of imatinib mesylate for the neoadjuvant treatment of patients with gastrointestinal stromal tumours. The key primary mappings are name links to DIP (see Figure 1). Although the trade name Gleevec is not specifi ed in the record, it can be found under the ‘show more names’ tab in DIP. There are 14 outlinks for imatinib including the molecular mappings to PubChem.

Inspection of the URL for the PubChem link shows that this is expedited via a CAS number look-up query against PubChem Substance – the submissions to PubChem Compound. This raises two issues. The first is the use of a molecular identifi er in a public database that can only be resolved against an extrinsic commercial source. The second is that this is also used by many independent submitting sources to PubChem with no guarantee their derived assignments are correct. Indeed, we can track a double mismapping in this case because the 10 substances collapse to two compound records, for both the parent (CID 5291 = CAS number 152459-95-5) and the mesylate salt (CID 123596 = 220127-57-1). The reason is that two substance records use the wrong (parent) CAS number for the mesylate. In addition, the DIP link for the imatinib parent actually links to the mesylate via the same CAS number. This might seem pedantic, but it should be noted that salt forms have a major infl uence on both in vitro and in vivo drug properties (3). Because the primary DIP mapping defaults to both parent and salt forms, this type of molecular naming ambiguity is widespread in the database. To be fair, this is partly an extrinsic problem arising from the not uncommon practice of formally assigning the INN to a parent structure, while the USAN or BAN may be assigned to one or more salts.

This next study example (NCT00648401 single-dose fed bioequivalence study of verapamil HCL versus Verelan®) is an older trial, but was updated on December 11 2011. This shows not only salt multiplexing, but also the more serious issue of false positives. By defi nition, a bioequivalence study compares the same chemical entities, but we have no less than six DIP-linked drug names in this entry; including dexverapamil, diltiazem, diltiazem hydrochloride and diltiazem malate. While dexverapamil can be resolved as a synonym for verapamil the DIP-to-PubChem links to this (CID 2520) and the hydrochloride were missing. However, while diltiazem is also a calcium antagonist, it was not used in this study. Not only is the structure different (CID 39186) but also, because of an error from one PubChem deposition (SID 103820682), the CAS number 42399-41-7 is linked to a second (wrong) structure as CID 292486. Leaving aside the two salt forms, the question arises as to how diltiazem, as a false positive, became linked to this study at all. From the DIP link ‘back’ to we see 83 studies from a search with diltiazem. Some of these include mentions of verapamil and further checking establishes they actually all co-retrieve as names but are not true alternative or combination studies. The origin of this erroneous coupling MeSH term sets in common between these entries.


As indicated, this resource is embedded in what could be termed a US Government ‘health hub’ that includes the Department of Health and Human Services, National Library of Medicine and National Institute for Health, FDA Resources on Drugs and Devices, PubChem, PubMed and others. While the integration and cross-referencing between these resources generally works, as we illustrate in the examples above, the specifi city degrades because of error-prone or fuzzy linking rules. This can result in name-to-identifi er-tostructure mapping spaghetti that is difficult even for an expert to unravel. In addition, we encountered some obvious gaps. For example, it was possible to spawn the list of 11,730 PubMed abstracts for publications related to clinical trial entries and link these to 3,424 PubChem compounds via MeSH. However, there is no query that currently can be performed in PubChem to answer the important question ‘which compounds have (direct) links to clinical trial drug entries’. This is likely to be signifi cantly below 3,424, not only because of parent-salt duplicates, but because also because some links are coming through TOXLINE.

The extension of cross-references to databases outside the US Government constellation (or ‘linked data’ as we might refer to it in the semantic web context) is both good news and bad. The good news is obviously the valuable extensions of other databases. For example, the ChEMBL database of activities extracted predominantly from the medicinal chemistry literature can make the important connection, via the trial identifier, between candidate and approved drugs their in vitro assay data, protein targets, in vivo pharmacology and translation into clinical trial results. Thus CHEMBL1487 for atorvastatin has 116 assay results and links through to the 480 clinical trials. Interestingly, the Wikipedia entry has no direct link from atorvastatin to but at least two indirect ones via ChEMBL and DIP. Paradoxically, the entry for pomegranate actually refers to the 32 clinical trials but also without a link. The bad news with database linking, as we all know, is that errors propagate instantly, relentlessly and globally. Thus, diltiazem has four ChEMBL compound name matches, CHEMBL23, CHEMBL1697, CHEMBL1200805, as the parent, hydrochloride and malate respectively, each of which include the same 88 clinical trials links as verapamil from CHEMBL197 (for the record the fourth match is to N-desmethyldiltiazem as CHEMBL1743343). As another route of error propagation, it is certain that the data is not only incorporated into many commercial clinical information products, but also that some pharmaceutical companies will add outlinks from their own databases. Depending exactly on what rules are applied during these data integration efforts, it is possible that some of the problems we describe may go unnoticed and simply be imported.

As mentioned, the success of has inspired a range of analogous databases extending to national health authorities and company-specifi c sites. The largest of these is the WHO International Clinical Trials Registry Platform ( with 145,323 entries from 13 national registries, which incorporates both a download from and outlinks to A difference in the data model is that the drug entry fi elds appear to be fi lled out manually, as evidenced by typos such as ‘atrorvastatin’. Nevertheless, while these queries also include synonym set cross-matches, we now fi nd 38 trials for verapamil, 27 for diltiazem (now uncoupled) and 462 for atorvastatin.


In our experience, high-value biomedical resources without data issues don’t exist. Consequently, we would like to emphasise that our analysis should neither be taken as a criticism of the database team nor those who submit entries. As has been pointed out, the usefulness of depends on the diligence with which complete, timely, accurate and informative data is submitted (1). We would add the equal importance of the underlying database model, occupancy and cross-mapping rules.

While our observations are gleaned from interrogating the web front end rather than the underlying schema, we have discerned a range of problems related to the precise molecular identities of the drug interventions. It should be emphasised that these are by no means restricted to just this database. As clinical data integration and mining efforts are being intensively pursued, particularly to enable translational research, we can expect the informaticians involved to explore technical options to ameliorate some of the shortcomings described. However, with non-specialist users, for which this front end is also designed, the possibility remains that errors and ambiguities could have consequences. While hesitant to make prescriptive recommendations and acknowledging this to be a council of perfection, we do suggest a ‘three-way iteration’. What we mean here is the use not only of a controlled vocabulary (as part of the data input format) but also that a specialist on the database side iterates directly with the principal investigator in filling out the entry. For example, an expert-to-expert curatorial handshake could chose the correct PubChem CID, and the corresponding InChI key, to be specified in the drug field. This would provide not only a globally open and Googleindexable chemical structure cross-reference, via InChI key look-up, but could also be linked as a direct PubChem source (‘these CIDs have trial data’). In all respects the news that the Clinical Trials Transformation Initiative (CTTI) is developing a restructured and reformatted database to expedite analysis of the aggregate data from is clearly welcome (4).

References Zarin DA, Tse T, Williams RJ, Califf RM and Ide NC, The results database – update and key issues, N Engl J Med 364(9): pp852-860, 2011
  1. Southan C, Várkonyi P and Muresan S, Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds, J Cheminform 1(1): p10, 2009
  2. Serajuddin AT, Salt formation to improve drug solubility, AdvancedDrug Delivery Reviews 59(7): pp603-616, 2007
  3. Visit:

Read full article from PDF >>

Rate this article You must be a member of the site to make a vote.  
Average rating:

There are no comments in regards to this article.

 You must be a member of the site to make a comment.
Christopher Southan is a senior scientist who engages across multiple expertise domains, including bioactive chemistry databases and drug targets. As a freelance consultant he is completing an assignment in a global Knowledge Engineering Programme, for enterprise application testing and exploitation documentation. Previous positions include the ELIXIR project at the EBI, Principal Scientist and Team Leader at AstraZeneca and senior bioinformatics positions in Oxford Glycosciences, Gemini Genomics and SmithKline Beecham. He has PhD from the LM-University of Munich, an MSc in Virology from Reading University and a BSc Hons in Biochemistry from Dundee University. Email:

Hilary Stephenson is the Managing Director for Sigma Consulting Solutions Limited, a team of usability and user experience professionals specialising in the design and development of information-rich, intelligent, enterprise-level websites, intranets and applications. Hilary has a background in usability, information architecture and technical documentation, working for clients including Nokia, ThermoFisher and AstraZeneca on designing customer product and service information solutions, web-based portals and corporate applications. Hilary has a BA Hons in English Language and Literature from the University of Salford and an MA in Technical and Business Communication from Sheffield Hallam University. Email:
Christopher Southan
Hilary Stephenson
Print this page
Send to a friend
Privacy statement
News and Press Releases

WuXi STA Forms Strategic Partnership with Coherent Biopharma

Shanghai, 1st December 2021: WuXi STA – a subsidiary of WuXi AppTec – and Coherent Biopharma, announced the signing of a strategic partnership agreement.
More info >>

White Papers

Investigator Portals: Facilitating and Streamlining Communication & Collaboration with Study Sites


ArisGlobal LLC Investigator site personnel are often frustrated with the lack of transparency and communication with their sponsors, leading to significant dissatisfaction and withdrawal from current and future studies. The adoption of investigator portals gives sponsors a platform that fosters collaboration and improves site management. Learn about five key areas of sponsor-investigator site interactions deserving of electronic communication and collaboration. Technology and design aspects are also covered.
More info >>




©2000-2011 Samedan Ltd.
Add to favourites

Print this page

Send to a friend
Privacy statement