In one of the recent AIMed webinars, Dr. Tanuj K. Gupta, Vice-President at Cerner Intelligence described big data in healthcare as three “V”. A volume of data across different sources and venues of cares; the variety of that data, and the velocity of access to that data. Ideally, the quicker the access to a large pool of data from multiple sources, the better.
In reality, healthcare data is often locked in their respective silos. A lack of interoperability between electronic health records (EHRs) and the concerns around patient privacy limit the amount of information professionals can leverage responsibly and develop or validate artificial intelligence (AI) driven solutions.
On top of active calling for data democratization, some researchers are also venturing into alternatives, so that medical progress and data protection will not advance in a mutually exclusive manner. Here we take a closer look at three alternatives including data distillation; synthetic data and federated learning and each of their pitfalls.
Data distillation and kNN model
Early on, researchers at the Massachusetts Institute of Technology (MIT) filtered and compressed MNIST, a popular computer vision dataset with 60,000 handwritten digits of zero to nine, into 10 images with optimized details so that an AI model being trained by them can be as accurate as those trained on all 60,000 images. Researchers at the University of Waterloo in Ontario brought the data distillation technique to a new level in a new process called “less than one” (LO)-shot.
Over here, researchers fed the AI model with soft labels or the percentages of features shared between digits. For example, if the image is digit 3, the AI model will learn that this image is 60% the digit 3; 30% the digit 8, and 10% the digit 0. As the AI model mastered LO-shot learning, researchers realized as long as the soft labels are carefully engineered, the machine can actually learn a lot from even a very tiny sample of data. They demonstrated the concept using k-nearest neighbors (kNN), a rudimentary form of machine learning which classifies objects using a graphical approach.
Pitfalls of data distillation and kNN model
If one will like to train a kNN model to understand the differences between an apple and orange, they will have to select features that represent the fruits and use them as the axis labels. For example, plotting the fruit color in x-axis and weight of the fruit in y-axis. This forms the basis for the kNN model to generate a 2D chart and mark clear boundary lines between apples and oranges. When new data comes in, the machine will decide accordingly where it should go within these identified boundaries.
Researchers created a small set of synthetic data, engineered their soft labels, and allowed the kNN model to plot the boundary lines. They discovered that the kNN model was able to split the 2D chart into more classes and researchers also have more control on where the boundary lines are. Nevertheless, the method has its major shortcomings. Data distillation becomes extremely complex when it comes to engineering soft labels for shrinking giant dataset.
Also, deep learning models may not be as transparent as kNN model. The University of Waterloo researchers are now working on ways to facilitate this. They believe although the method is still at its infancy at the moment, it has the potential to make an AI model just as efficient and accurate without the need for large quantity of data.
According to Professor Mihaela van der Schaar, John Humphrey Plummer Professor of Machine Learning, AI and Medicine at the University of Cambridge, there are several types of synthetic data but in general, they referred to the fabrication of information that can reproduce statistical properties present in the original dataset. Data can be partially (i.e., fabricated data use alongside with authentic data) or fully synthetic and they can be used to train or validate AI models, to minimize sharing of sensitive details.
Synthetic data can be created using generative adversarial networks (GANs) or by having two neural networks working against each other. The first network (i.e., generator) is responsible of generating artificial outputs through imitations of the training examples. The second network (i.e., discriminator) would then decide whether the fabricated outputs are real by comparing them with the training examples. Whenever the discriminator rejects an output produced by the generator, the latter will go back and try to recreate the imitation again. The process repeats until the discriminator is not able to tell whether the output was genuine or fabricated training examples.
Another way of creating synthetic data, according to Robin Röhm, co-Founder of the Berlin based startup Apheris, is via Bayesian networks. These networks come with a graph which models conditional probability distributions of a set of attributes. One can draw synthetic samples based on the structures laid out by the graph.
Pitfalls of synthetic data
However, it is computationally expensive to underlie all the correlations between various attributes. Bayesian networks also take a long time and it is not flexibly adaptable to process images. Overall, the high-dimensional and heterogeneous nature of patient records challenge the generation of realistic synthetic data. This is especially so when it comes to rare diseases when the sample sizes tend to be small.
Method like GANs is not good at handling outliers and because two neural networks are being involved simultaneously, it is hard to determine whether both of them have been sufficiently trained. At the moment, there is no consensus on how to define or measure the quality of synthetic data. So, it becomes crucial to address how well does the fabricated data reflect the actual data and how to determine whether the fabricated data is fit for purpose.
Typical machine learning approaches require all training data to be in the same machine or data center. Often, healthcare institutions refused to hand over patient data to third party to garner privacy. This not only limits the amount and quality of information that developers can get hold on, but also an AI model may not necessarily be generalizable to other institutions apart from the one where it received its training data from.
Federated learning, on the other hand, keeps data at where they are as institutions pass around the semi-trained AI model and beefed it up with their respective datasets. As such, federated learning is regarded as a “privacy-preserving” model training method.
Two years ago, University of Pennsylvania Perelman School of Medicine and Intel pioneered in applying federated learning to real-world medical imaging data. Since then, researchers at Penn Medicine have been actively deploying the method to analyze MRI scans of brain tumor patients and identify healthy brain tissue from cancerous regions.
Pitfalls of federated learning
Some believed federated learning does not consider statistical and system heterogeneity. As mentioned in the beginning, healthcare data tend to be kept in silos, in separate systems that do not or are unable to communicate with one another. There may not be a standardized format in keeping some of these data. Before federated learning can truly take place, partnering institutions have to spend time on setting up a structure for sharing the model and also ensure a compatible computing power across all platforms to hold and train the model.
More importantly, federated learning is not bullet proof. Some sensitive information may also be revealed when institutions communicate the kind of updates they have performed on the model. Although secure multiparty computation and differential privacy are methods to adopt to enhance privacy, they may undermine model performance and overall efficiency. In sum, opening data access is great and having alternatives is never a bad thing too, as long as institutions weigh all their advantages and disadvantages to make decisions that are not a result of hype.