How London’s UCL Institute of Health Informatics created an open-source database and ‘Github’ of phenotyping


“A 45-year-old woman came to the clinic with her blood pressure recorded as 178/78,” says Professor Harry Hemingway, Professor of Clinical Epidemiology and Institute Director at London’s UCL Institute of Health Informatics. “She had previously received radiotherapy for her left-sided breast tumor and her mother died of a brain hemorrhage. Her GP is uncertain, given the family and medical histories, whether her blood pressure should be kept low. In fact, there are no clinical guidelines nor recommendations in the world that would help the GP to decide how to best manage this patient. This is when we need to examine past data and look at how the health system has handled such patients.”

Professor Hemingway cited a study involving 500,000 UK National Health Service (NHS) patients showing a strong relationship between high blood pressure, left-sided breast tumor and the risk of subarachnoid hemorrhage. Another study involving two million individuals also found a relationship between blood pressure and another serious cardiovascular disease – abdominal aortic aneurysm. “All these studies were done on a diverse population of 45-year-old women encapsulating their text records, ECGs, imageries, mobile wearables data and drug and genomic information,” Professor Hemingway continues.

However, researchers can’t always conduct such large-scale studies, mainly because patient data are siloed and locked behind the healthcare system. As such, the UCL Institute of Health Informatics was founded on the basis of utilizing data to develop a better understanding of disease prevention and improving patient outcomes on a national and international scale. An academic department within the Faculty of Population Health Sciences at the School of Life & Medical Sciences, the Institute believes large datasets can increase the detail of research findings, ensuring they’re reflective of the population by limiting biases often associated with more traditional trials.

“There are three benefits of mapping the patient I mentioned at the beginning with past data,” Professor Hemingway explains. “First, the patient could identify if there are randomized trials which she may enroll. Second, her GP may note down the uncertainties and arrange her for future clinical trials. Lastly, those observational data – drugs being administered, admission and readmission details, and even deaths of patients – are valuable insights to inform the GP how to best manage this patient. Why can’t we make the process of ordering an informatics consult, as simple as ordering an investigation like an MRI or CT scan? We should make data in the wild more accessible to clinicians and researchers to generate a range of insights. That’s how healthcare should progress.”

Indeed, to facilitate clinicians’ access to past data, as well as initiating scientific value-adding studies, recreating longitudinal pathway of patients through healthcare settings, looking into disease onsets and progressions, and collaboration within academia, the NHS, and commercial partners, UCL Institute of Health Informatics created CALIBER. It’s a platform providing “research-ready” information taken primarily from CPRD (Clinical Practice Research Datalink), which collects data of primary care activity, secondary care including Hospital Episode Statistics inpatient, outpatient, A&E and diagnostic imaging dataset data, as well as details on national registry social deprivation and mortality from the Office for National Statistics of 15 million individuals up till March 2016.

The Institute also runs the CALIBER Phenotype Library, funded by Health Data Research (HDR) UK, to store and share over 350 phenotyping algorithms, 900 plus codelists, metadata and tools to improve research reproducibility in the community. “The analogy I often use is data are a bunch of Lego blocks,” Professor Hemingway adds. “We are combining these blocks in standardized ways that can be shared and used by other scientists in an online, open platform, to define the fundamentals of diseases and health. Something that some people might think ‘haven’t we already done?’ But in the context of large-scale data, absolutely ‘no, we haven’t’.”

Right now, Professor Hemingway is harnessing patients’ genome data. The 45-year-old female patient has had her genome sequenced like the other 100,000 genomes project. From there, Professor Hemingway and his team would know if her blood pressure is different, compared to others, and if it has certain actionable implications such as a change in diet, daily routine, or a requirement for more exercise. More importantly, Professor Hemingway believes linking genome data with medical records marks part of “a medical revolution in how we develop new drugs as genome information validates the efficiency of the drugs.”

“We are not just converging care and research, but we also seek to replicate science and return value to clinical care,” says Professor Hemingway. “We wish to support the research community to go on and answer more interesting questions around the relationship between diseases and systematically define what we mean by diseases related to age, gender, and other demographics.”