Unlocking the Power of Posit and R for Precision Medicine Development

February 27, 2023
Marie-Ange Kouassi & Hagen Klett

At R/Pharma, the annual conference organized by Posit (formerly RStudio), speakers shared how the programming language, R, and the advanced development environment Posit, is currently being leveraged to unlock value from pre-clinical and clinical data. But while there is improved open access to solutions for data processing, analysis, and visualization, the challenge remains to ensure everyone involved in drug development has access to the right data in a high-performance environment to answer diverse questions.

Introduction

At the last R/Pharma conference, it was evident that the applications of R in the biopharma space are evolving quickly with a rising focus on unlocking insights from Real-World Data (RWD), the generation of interoperable data products, and improving clinical regulatory submissions. Previously known as RStudio, Posit, which includes open source-software tools like the RStudio Integrated Development Environment (IDE) for R, Shiny, and tidyverse, allows data consumers to create toolboxes known as packages e.g., pharmaverse, to address their data-related needs. Through Posit’s seamless integration with Genedata Profiler^®, the validation-ready, cloud-based data integration and analytics platform (Figure 1), bioinformaticians, scientists, and clinicians can maximize the value of R and Python for applications explained in this article.

R for Utilizing Real World Data

In the early stages of drug development when researchers explore drug targets for intervention or evaluate compounds with disease-reversing activity, navigating various datasets to identify specific datasets of interest is crucial. At R/Pharma, with a joint talk entitled “Disrupting drug development for data-driven approaches”, Alice Walsh, VP of Translational Research at Pathos, and Catherine Igartua, Ph.D., Senior Director of Computational Biology at Tempus, highlighted that RWD will enable more innovation in drug discovery. Unlocking useful insights from heterogenous RWD will require extensive data mining and curation, statistical analysis, and machine learning of large-scale multi-omics data, as well as interactive visualization. But with the continuous tremendous increase in data volumes, it becomes extremely important to easily retrieve required datasets for downstream analysis. As an integrative environment, users can efficiently access diverse data from Genedata Profiler’s high-performance analytics database for computationally intensive analysis in Posit. The scalability and elasticity of the platform, and the seamless feed of data into applications, enable users to rapidly query and condense billions of data points to a selected subset for future use and result extraction.

Learn How to Improve Data FAIRness

R for Translational Research

The transfer of insights from basic research to clinical stages of drug development is a well-adopted approach that facilitates the development of novel treatment interventions. On the other hand, the delivery of insights from late drug development stages back to basic research, reverse translation, is swiftly gaining momentum. Insights derived using this approach can support faster decision-making throughout drug development improving clinical success by enabling patient stratification (through the discovery of novel biomarkers), supporting clinical trial enrollment, drug repurposing, and companion diagnostic development. Leveraging data from various sources and all drug development stages is critical for this but first requires that all these data are easily reusable. Generating data products involves preparing and packaging data in a way that allows an individual to quickly identify how to use it (which analytics tool to use) for their specific use case.

At R/Pharma, it was highlighted by Patrick Hilden, Associate Director of Market Access Analytics from Janssen(Johnson&Johnson) during his talk: “RMarkdown for enhanced Quality Control & Documentation”, that data should be prepared with possible future questions in mind. The Posit environment including “RMarkdown” and “Quarto”, as introduced by Thomas Mock, Ph.D., Customer Enablement Lead at Posit, streamlines the quality control and generation of data products that can be leveraged for different purposes using data consumption applications. These can also be used for enabling interactive analysis and visualization for insight generation as highlighted in many talks about R’s interactive framework, Shiny, such as “Building Shiny Frameworks: Some Lessons” presented by Harvey Lieberman, Associate Director, Early Development Analytics, Digital & Discovery, Novartis. Organizations conducting translational and reverse translational research also need to democratize data to appropriate users ensuring data is only in the hands of the right people, being handled the right way, and for the right purpose. This is crucial as failure to comply with GDPR could result in penalties that may range up to €20 million, or 4% of the company's global annual revenue from the preceding financial year, whichever is higher. Genedata Profiler reduces this risk by providing fine-grained permissions governing data visibility, access, and handling of imported and federated data. It also controls access to data via external tools like Posit maintaining a high level of data security.

Learn How to Unlock Insights from RWD

Furthermore, Genedata Profiler makes it easy for data engineers and data scientists to collaboratively develop tools in R by providing an integrated code repository solution with CI/CD at their fingertips. As a valuable development environment including a plugin SDK that augments and facilitates routine software development tasks, Genedata Profiler automates the building, versioning, testing, and deployment of pipelines, data products, analyses, and visualizations. This enhances accuracy and provides efficiency streamlining the provision of point-and-click solutions and interactive reports to non-coding data consumers for insight generation. This way, data engineers and data scientists gain additional time for research and developing new exploratory tools to unravel hidden relationships within diverse datasets, enhancing their organization’s precision medicine pipeline. Similar to how CI/CD enhances R package development allowing to assess the integrity of code and gain continuous feedback, (“How CI/CD enhances development of R packages” presented by Ben Straub, Principal Programmer at GlaxoSmithKline & Craig Gower-Page, Data Scientist at Roche), internal and external developers can use self-contained applications (e.g., Docker containers or conda environments) to build on the functionalities of Genedata Profiler using the integrated plugin SDK with incorporated CI/CD while the core of the platform remains unchanged and easily operable. This isolation of external development activities further improves security and reduces the risk of introducing harmful malware into your organization’s computational infrastructure.

Figure 1. A diagram showing how a data consumer can benefit from Genedata Profiler’s high-performance analytics database, data lineage, governance, and access permissions while leveraging Posit for GitLab-supported programming in R, Jupyter, and VS code

R for Clinical Development

The closer biopharma organizations approach filing for investigational new drug application (IND), the more they need to ensure that data gathering and handling throughout clinical development complies with established standards such as CDISC as explained during the talk of CDISC’s VP of Data Science, Sam Hume entitled “CDISC initiatives and collaborations”. Also important is data governance and an audit trail depicting where and how data is stored as well as who has done what with which data. Providing all this information to regulatory authorities is crucial in conjunction with other documentation of GxP validation. While the analysis for FDA submissions is mostly done in SAS, more companies are considering moving to open-source technology such as R, to have greater flexibility and to leverage their growing number of skilled experts. (Presentedby Kevin Lee, Assistant VP of Data Science and Machine Learning at Genpact, in the talk “Enterprise-level Transition from SAS to Open-Source Programming for the whole department”; by Claus Dethlefsen, Statistical Director at Novo Nordisk in the talk “Co-existence of R and SAS for multiple imputation in Novo Nordisk”, and by Rose Grandy, Principal Statistical Analyst at Abbott during the talk “ 10 Practical Considerations for moving to R”).

At R/Pharma, the need for implemented standards of clinical formats was highlighted to unify and simplify the way regulatory submissions are performed. Collaborative attempts to streamline reporting to regulatory authorities were also discussed during “Pharmaverse: Breaking boundaries through open-source collaboration”, the talk by Ross Farrugia, Data Engineer and Product Family Lead for Genentech, Roche. Some inherent challenges mentioned include reproducibility, system validation, and the submission of R packages with private information to regulatory authorities. During her joint talk with Ning Leng, People and Product Lead of Product Development Data Sciences at Genentech, Roche entitled “R Pilot Submissions to the FDA”, Hye Soo Cho, a statistician from the FDA also mentioned the need to prove the reliability of data management and statistical analysis software as well as testing procedures performed. Lastly, any code used during the process should be submitted for review. When using code within Genedata Profiler to process and analyze clinical data formats (e.g., CDISC, SDTM, and AdaM), users benefit from built-in full data lineage, code version control, CI/CD, and data access reports (audit trails), and full documentation for streamlined computer system validation (e.g., IQ, PQ, and QC). As the Genedata Profiler environment can also be connected to SAS and Posit, it enables a side-by-side comparison, making the transition to open-source a smooth process. With Genedata Profiler, complete validation of this environment is possible through automated testing procedures for compliance with regulatory requirements.

So, What Are Some New Posit Features for Biopharma Development?

While the R/Pharma conference served primarily as a platform for biopharma organizations all over the world to share how they are using R to achieve their goals, overcome challenges, and unlock opportunities, attendees had the chance to learn about a few new developments in Posit. One was Quarto, introduced during a talk by Thomas Mock, Ph.D., Customer Enablement Lead at Posit. Quarto is essentially the next generation of R Markdown- It’s an open-source scientific and technical publishing system built on pandoc, the command-line tool. As a language-agnostic command line interface, Quarto can be accessed through R, Python, Julia & JavaScript and has integrations to Posit (RStudio) and VS code. This enhances collaboration on scientific communication, data analysis, and data science projects between scientists and non-technical domain experts (e.g., clinicians) as each user can interact with the data the best way they know how.

Max Kuhn, Software Engineer at Posit, also shared about the new developments for modeling in the Tidyverse with Tidymodels, a collection of packages for machine learning models using tidyverse principles. These included models for censored regression which featured new modeling modes, computational engines, and standardized survival packages for predicting survival time. A new AI platform, known as h20, was also mentioned which can handle large datasets without expensive data transfer penalties. As the platform has APIs for Python and R, parallel processing is made easy with other external tools.Some added enhancements to the clustering models were also mentioned. If you’re curious, watch the talks here.

Conclusion

R/Pharma 2022 was an insightful conference bringing experts in data science, bioinformatics, and drug development together to exchange insights on the applications of R. The community continues to make strides in working collaboratively to overcome the challenges associated with unlocking value from data, all to bring treatments faster to patients.

Mention of individuals, companies or research organizations in this article does not indicate their endorsement of Genedata or its products.