Leveraging Technology to Support Integrative Approaches for IMID Precision Medicine
May 5, 2021 Justyna Lisowska & Marie-Ange Kouassi
When it comes to the treatment of Immune-Mediated Inflammatory diseases (IMIDs), determining the right treatment for a particular patient is challenging. The development of novel treatment is also difficult as these diseases, and the responses of patients to existing drugs or therapeutics in development, are highly complex. For heterogeneous and progressive diseases such as IMIDs, it is imperative to obtain an improved understanding of the underlying biological mechanisms, identify different endotypes, and uncover novel, promising drug targets. It is now clear that this cannot be achieved by examining one dimension of the biological system (i.e., focusing on the changes on a single molecular level e.g., changes in gene expression) but by combining data from different levels (e.g., epigenomics, transcriptomics, proteomics, etc.) through multi-omic data integration.
The arrival of advanced profiling technologies such as Single-Cell RNA Sequencing (scRNA-Seq) and Spatial Transcriptomics / Proteomics, allow the collection of such data in vast amounts and an ever-increasing resolution, however, integrating and analyzing this data is laborious. Advanced technological approaches such as deep learning have improved the understanding and treatment of cancer and hold significant promise for the field of IMIDs. These approaches could allow scientists to better understand IMID mechanisms, develop and validate biomarkers, repurpose existing treatment, and increase the precision and success of clinical trials. In this article, we will discuss the power of integrative technological approaches and tools for the development of IMID precision medicine.
Access to the Right Data
To undertake successful research, scientists require an in-depth knowledge of previous discoveries, as well as their limitations/shortfalls. It is also beneficial to undertake analysis on available data to uncover insights that will guide future studies. Given the high attrition and failure rate of clinical development which results in severe economic losses, it is fundamental to learn from previous studies and transfer knowledge between different organizations regarding protocol design, scientific outcomes, disease mechanism, and patient responses to treatment. For every clinical development project lasting about 10-12 years, vast amounts of valuable data are generated by research institutions, biopharmaceutical companies, and clinical research organizations, even if the clinical trial is not successful. This available data is not sufficiently used consequently delaying the progress of future trials. Insights from such data could inform protocol and clinical trial design, contribute towards the understanding of disease mechanisms, or inform GO/NO-GO decisions in related trials. Integrating data from different studies or previous clinical trials is difficult as access is hindered due to silos and barriers enforced due to data sensitivity.
Today, efforts are being made to develop a “bedside-to-bench” approach, which facilitates the transfer of knowledge gained from the clinic, back to the laboratory. This approach, also known as “reverse translation,” allows to unlock scientific insights that will strengthen the understanding of scientific principles, while uncovering promising opportunities early in clinical development. However, as data collection approaches and standards of curation often differ between organizations, this presents challenges in integration and making comparisons through further analysis.
Several public repositories have been generated to gather data from previous studies and make them accessible for future research purposes to strengthen the understanding of disease mechanisms and bring to light opportunities for new therapeutic discoveries. In the field of oncology, these include the Cancer Cell Line Encyclopedia (CCLE), and the Cancer Genome Atlas (TCGA) with which discoveries have been made related to the manifestation of cancer, its progression, and treatment outcomes (2). These repositories are highly valuable as they contain data from different levels of the biological system (-omics data), which could be integrated and analyzed holistically to identify valuable scientific insights or make valuable predictions. For example, the CCLE which provides access to genomic data from almost 1000 human cell lines and around 40 tumor types as well as pharmacological profiles of 24 anticancer drugs, has proven useful in the generation of a predictive model of drug response in cancer cell lines (3).
The Immune-mediated Inflammatory Disease Biobank in the UK (IMID-Bio-UK) is an initiative funded by the Medical Research Council which aims to unify molecular and clinical data from eight IMID cohorts/tissue banks (RA, SLE, psoriasis primary biliary cholangitis (PBC), autoimmune hepatitis (AIH), and primary Sjögren’s syndrome) and make it available to academic researchers and commercial institutions to facilitate research and clinical development. Efforts to compile IMID data are progressively being made by academic players through collaborative research programs (e.g. 3TR by a European consortium of 69 academic and industry partners), and not-for-profit organizations (e.g. IBD Plexus by The Crohn’s and Colitis Foundation). However, to improve the treatment of IMIDs, such initiatives should be organized continuously on a much larger scale and not be limited by time or funding. It is also important to define standards for data sharing from across electronic data capture (EDC) systems, sites, and/or organizations to stimulate productive collaboration while ensuring data security.
To perform comparative analyses using complex data from a variety of sources, computational solutions are required that enable the aggregation of datasets from disparate locations, and conversion into a comparable format (data integration).
Multi-Omic Data Integration and Analysis
Since genomics, the first omics discipline to appear, several different omic types have emerged including epigenomics, the study of the reversible gene expression altering chemical modifications to the DNA or the histones that bind DNA. Another well-known type of omic study is Proteomics, the study of the complete set of proteins expressed by a cell, tissue, or organism, and metabolomics, the complete set of small-molecule metabolites in a biological sample. There are other types of omic data, and each provides a view of how biological processes are affected in different disease and control groups (4). During translational and clinical research, it is important to integrate these different types of data as analysis of one type may only reflect processes indicative of the reaction to disease rather than the cause (4). Integrating the different data types improves biological insight, providing a clearer and more in-depth understanding of the disease. This could enable disease subtyping and classification, biomarker identification, and the building of predictive models of risk, treatment response, and clinical outcomes(1).
Integrative data analysis has applications that are likely to revolutionize disease diagnosis, prognosis, and treatment. There are a variety of open-source tools/methods that allow for the integration of multi-omic data using different approaches. For network-based integration, similarity network fusion (SNF), Paradigm, and NetICS are commonly used. To implement a Bayesian multi-omic data integration approach, patient-specific data fusion (PSDF), icluster, multiple dataset integration (MDI), multi-omics factor analysis (MOFA), and Paradigm are typically the selected tools (1). Pattern fusion analysis, SNF, or PSDF are also commonly used for a fusion-based approach. Additional approaches include similarity, correlation, and multi-variate-based integration for which several tool options for researchers exist. Following integration, the data can be analyzed using a variety of computational tools/platforms that enable the exploration, analysis, and visualization of the data. In this case, analyzing the data becomes complex as it may require expertise in bioinformatics for the generation of visualizations or predictive models using e.g., artificial intelligence.
Artificial Intelligence Applications in Drug Development
Artificial Intelligence is the use of computers to perform tasks normally requiring human intelligence (e.g., making decisions). The subfield of AI; machine learning, involves learning from past experiences and has beneficial applications in all stages of drug development from target identification and validation, compound screening, and lead discovery to clinical development (5).
Machine learning is the use of algorithms to parse and learn from existing relationships in data to determine or predict the future state of any new datasets. The best scenario to use this approach is where there are large amounts of data and several variables, yet the relation between them is unknown but desired. Using machine learning, discovery and decision-making can be vastly improved given the data is of a large quantity and a high quality. Well-known supervised machine learning methods can be used to train a model that learns the pattern between input data and outcome, which then can be applied to predict the outcome from new data. Unsupervised machine learning methods can be used to find relationships within the data without knowing the desired outcome. The third paradigm of machine learning is reinforcement learning which enables learning in an interactive environment through feedback from previous actions/decisions (to maximize cumulative reward). A more modern approach of AI is deep learning (a subfield of machine learning) which involves the use of deep artificial neural networks (e.g., convolutional neural networks; CNN) and has many beneficial applications including image classification.
A few applications of AI in drug discovery include the identification of novel targets, the understanding of disease mechanisms, and the development of new biomarkers to monitor prognosis, progression, and drug efficacy (5,6). To generate a therapeutic hypothesis, predictions made using machine learning require further validation. When using this approach, it is important to consider the type of data available, the type of algorithm selected for use (whether it is indeed suitable for the data and problem at hand), and how well the method separates the signal from noise. Machine learning is powerful during drug development when well-annotated data is generated systematically with minimal noise. Where this is not the case, it can present challenges during data cleaning and processing (5).
In the field of oncology, innovative approaches such as artificial intelligence, are being used to perform advanced data analysis, and they are increasingly used for the study of IMIDs as well. For oncology, these techniques enable precision medicine for diagnosis, molecular and tumor microenvironment (TME) characterization, the prediction of treatment outcome and drug-response, and pharmacogenomics discovery (5, 7, 8). In the field of IMIDs, Egis Pharmaceuticals PLC utilized machine learning to identify biomarkers for the classification of Rheumatoid Arthritis patients according to treatment response. This involved training gene expression data and has been recently developed into an approved clinical in-vitro diagnostic test named PREDYSTIC® (9).
Challenges of Integrative Data Analysis and AI
Implementing machine learning involves several steps which begin with data collection and preprocessing before model selection, training, and tuning. Most of the time using this approach is spent on data processing with a minimal amount of time spent on algorithm application. This is because to maximize predictability, the data used for training must be accurate, curated, and complete. It is often difficult to apply this approach as data gathered during drug development is often heterogeneous, requiring standardization before further downstream analysis.
Large amounts of high-quality training data are also required to improve the robustness and interpretability of the models. When applying a predictive model, it is important to consider the dangers of biases and underfitting, where the model cannot model the training data or generalize to the new data, and overfitting, where the model learns the noise and incorporates this in the prediction (5). These risks can be reduced by using several strategies such as cross-validation, increasing the size of the training set, or manually setting the predictive features. To minimize biases that could lead to underfitting, it is highly important to ensure the dataset is representative (i.e., obtained from a diverse population of patients).
While undertaking machine learning in drug discovery, large quantities of data are required, therefore, this often involves the combination of data from multiple datasets or even studies. As these datasets may be difficult to access and may exist in a heterogeneous format, the effectiveness of this approach can be limited. There are ongoing efforts to develop standards to ensure data is accessible and well-annotated in drug development. This will improve the accessibility of data from clinical trials both failed and successful, for further analysis.
While ensuring that data is accessible to scientists for analysis to answer hypotheses, it is also important to guarantee that patient-sensitive data is protected. This is referred to as data governance and involves consistently monitoring the consent information from subjects involved in these studies.
Integrating data from high-throughput technologies can be extremely difficult as these exist in large quantities and a variety of different formats. This variability makes the processing and managing of this data quite challenging. It is impossible to integrate or analyze this data without standardization and harmonization steps to translate the data into a unified format which is time-consuming (10). Although new analytical tools are being developed consistently, they exist in isolation to data storage, processing, and management systems (1). The utilization of separate systems for data processing, integration, and analysis leads to silos and may prevent re-usability of data, leave opportunities for error, and reduce efficiency.
Scientists require a solution that enables the right data from previous studies to be accessible and serves as a uniform end-to-end framework for efficient processing, integration, and analysis of multi-omics data (1). Such a framework would also be highly beneficial if it enabled scientists to generate visualizations and interpretations independent of their level of bioinformatics knowledge or experience.
Genedata Profiler®, the Single Point of Truth for Multi-Omic Data Integration and Analysis
Pharmaceutical companies such as Genmab, Merck, and Chugai with a focus on leveraging innovative technology to develop precision medicine are productively using Genedata Profiler in their day-to-day translational and clinical research activities. Genedata Profiler serves as a single point of truth that enables the storage, processing, and analysis of high-dimensional data in an efficient, secure, and traceable manner. Embedded in the enterprise software are workflows for the harmonization and integration of heterogeneous data types which can be configured depending on the type of data or analysis a scientist needs to perform.
The software is flexible, therefore, can be connected to existing databases to facilitate data transfer upon query. Using permission-based access and workflows to control the flow of patient data(11), the software also enables real-time governance ensuring only those with the right to access the data can do so and that sensitive patient information is protected. Genedata Profiler is a high-performing analytics layer that allows the integration of data from across studies and technologies to investigate specific scientific questions on command. These integrated datasets can be presented to business intelligence, statistical, and artificial intelligence tools for further data interrogation. The interoperability of the Genedata Profiler allows scientists to use their favorite analytical tools or easily adopt newly emerging technologies.
As the software is purposely built for collaboration, scientists can easily share visualizations or reports containing full documentation of the activities performed on the data, with team members or other organizations such as regulatory authorities. Genedata Profiler has numerous applications in the development of precision medicine for IMIDs from patient profiling (to identify responders and non-responders to treatment), endotyping, biomarker discovery, and validation. It is the tool of choice capable of addressing the complex technical, and organizational challenges mentioned above providing the path to precision medicine.