Simplifying Target Discovery with an Artificial Intelligence Engine and Natural Language Chat

Across the pharmaceutical industry, companies are finding new ways to leverage artificial intelligence (AI) to accelerate and revolutionize drug discovery and development, clinical trials, and other critical activities. Simultaneously, generative AI platforms, most notably ChatGPT, are becoming popular with the public and are being applied in an ever-growing number of contexts. In this Q&A, Petrina Kamya, Ph.D., Head of AI Platforms and President of Insilico Medicine Canada, discusses Insilico’s AI-based software platforms and how adding a natural language–based chat functionality to their PandaOmics target discovery engine enables researchers to probe large, heterogeneous data sets through conversations with the platform.

David Alvaro (DA): To begin, can you introduce us to Insilico Medicine in general and the company’s overall mission?

Petrina Kamya (PK): Insilico Medicine is an AI (artificial intelligence) drug discovery company. We have two discrete but overlapping business models: we develop generative AI–driven software, and we use that software internally to develop our own assets. We license both the software and the assets that we create.

Insilico was established in 2014, having emerged from Johns Hopkins University in Maryland. Since then, we have grown to be a global company with headquarters in Hong Kong and New York and offices in Abu Dhabi and Montreal, Canada. Our R&D center is in Shanghai, and we have a robotics lab in Suzhou, China. We are now located literally all over the world; I believe that we cover all time zones.

In terms of our mission, we are focused on pursuing diseases for which there is high unmet need and accelerating the discovery of new targets and new therapeutics to get them to patients faster, primarily by using AI.

DA: Was that dual business model planned from the beginning, or did it emerge along the way?

PK: Initially, we began developing deep learning algorithms to address many of the challenges associated with drug discovery and development. We put these algorithms together and built three specific platforms: PandaOmics, which focuses on target discovery; Chemistry42, which is involved in small molecule discovery and development optimization; and InClinico, the platform we launched most recently, which we developed for clinical trials planning and outcome predictions, to predict the probability that a program will transition from phase II to phase III.

That was the founding goal of the company. However, we quickly realized that we needed to validate the software platforms in order to show that they truly worked as we intended. To do so, we started to build out our own programs. Beyond successfully validating our platforms, we saw clear value in developing these programs to the point where they can be licensed out, since that becomes another revenue-generating business model.

DA: Across the areas that your platforms support, where are some of the most pressing bottlenecks, data burdens, or shortcomings that are best served through AI approaches?

PK: First and foremost, for target discovery and chemistry more broadly, there is a real need to overcome human bias in terms of identifying novel targets and candidate molecules that could become first-in-class therapeutics. In target discovery, AI will take multimodal data and tease out patterns that can help identify novel targets. Beyond the targets themselves, it can help us understand the pathways and the genes that are implicated in the disease, as well as other diseases that are linked to that disease. AI has a very strong ability to uncover patterns in multimodal data that are otherwise quite difficult for us to decipher on our own.

In chemistry, AI is very good at imagining things without the necessary human bias that we have. Many generative AI technologies have emerged that can be leveraged to discover novel chemical molecules. In addition, we are using other techniques to reinforce active learning to improve those molecules and to optimize them, so that they satisfy certain properties that are necessary for drugs.

In the realm of clinical trial outcomes predictions, the AI models underlying the InClinico platform allow us to again tease out features that affect the probability of the success of clinical programs during the critical transition from phase II to phase III that you would otherwise not be able to identify. Essentially, the core of all of these platforms is this ability to find patterns in multimodal data that are out of reach of human researchers.

DA: For the sake of clarity for those of us who may be a little behind on the AI field, can you explain what is meant by “generative AI?”

PK: In the simplest terms, you can think of “generative AI” as any use of AI to create something new based on data on which it was trained –– essentially, any time AI is asked to generate something. You might be familiar with some of the popular applications that generate text (like ChatGPT), voices, or images; in the same way, you can use AI to generate molecules. It uses neural nets, deep learning algorithms, and so forth, but the aim is to generate something new.

DA: What do you think differentiates PandaOmics from other computational target discovery methods that have been developed?

PK: We have quite a few strong differentiators. For example, we have our time machine approach and our iPanda algorithm. Both of those are used to identify the relationships between a gene and a disease in a manner that is unique to PandaOmics. In addition, we recently added a transformer-based knowledge graph, which is a feature that takes available information related to a disease and maps out all of the connections that are found in the literature. A user can then use this knowledge graph to better understand the relationships among genes, diseases, and medications that are used, the pathways that connect them all, and other diseases as well.

Most recently, we have connected that knowledge graph to a chat functionality based on large language models. I believe that we’re one of the only companies if not the only company that has done this with a target discovery engine. This chat functionality, which we call ChatPandaGPT, allows the user to query that knowledge graph and identify what these relationships are, based on exactly what the user would like to know. ChatPandaGPT makes the knowledge graph more accessible and more user friendly, and it makes the information more understandable as well.

2023_Q1_image_InsilicoMedicineChatPandaGPT2

DA: This really seems like an important development, since no matter how good AI is at accomplishing things beyond the reach of humans, the results ultimately need to be translated in a way that a human operator can understand. Before ChatPandaGPT was developed, what was the user interface and experience with PandaOmics like?

PK: Things were definitely somewhat disconnected. The knowledge map would be a beautiful image centered on the disease. There would be edges connecting different nodes, which would be a map of other little circles representing genes that are implicated, all of which would be connected. But you’d have to access this information in a piecemeal manner. You’d go: “Oh, this gene is interesting,” and you could click on that and find out more information. You’d be taken to a gene page and find out more information about that gene. And then, depending on what’s written on the edges that connect with the different nodes, the gene might either be upregulated or downregulated. So, the burden would to some extent be on the user to aggregate this piecemeal information and assemble it meaningfully at the end.

In contrast, with ChatPandaGPT, you can just type in as a prompt: “Show me the genes that are implicated in this disease and any other diseases and list the other diseases that are implicated.” And all the information that you are looking for will be listed out for you. Whatever sort of relationship you’re looking to find out more about, you can just type it into the prompt. And the ChatGPT functionality will talk to the knowledge graph, which is very specialized information, and then transform that into a form that is more informative to you and more comprehensive.

DA: In my limited experience with ChatGPT and things of that nature, I’m constantly discovering new features and benefits beyond what I was originally seeking. In the development of ChatPandaGPT, did you begin with certain goals in mind but found that unexpected benefits emerged along the way?

PKIn my personal experience using ChatPandaGPT, I was surprised at how useful it actually is. Initially, I thought that it would essentially ingest information and spit it back out in different ways that would be useful. But I found the interface to be so much more informative, and it     simplified     the whole process a lot more than I thought it would. In addition to that, we’re now looking into how we can incorporate this technology into our other platforms as well. We’ve found that it’s surprisingly useful, and we’ll see what additional uses and benefits evolve as we go.

DA: Will that process of applying the chat interface to your other platforms, like Chemistry42 and InClinico, be fairly straightforward?

PK: It should be relatively straightforward. It took our team about a week to do it for PandaOmics. That’s very fast, although I won’t say it was easy     , and they were able to integrate this very, very cool technology into the platform that has shown to be incredibly useful.

ChatPandaGPTQ&A (1)

DA: Does the chat functionality shift who the potential user can be and make it more widely accessible to a less specialized person?

PK: Absolutely. The work that goes behind creating the disease page and analyzing the data is not for everyone. But once the disease page has been analyzed and created, the ChatPandaGPT functionality definitely increases the usability and the accessibility of this information.

DA: This clearly represents a significant step forward in probing these multimodal data sets. Do you think that there is still a great deal more that can be unlocked here, perhaps as natural language processing itself continues to evolve?

PK: I definitely think so. There is a lot of information out there already, and biology is still not very well understood. We are always trying to investigate diseases in a much more in-depth way, and the etiology, pathology, and epidemiology are still very much unknown for a lot of diseases. The heterogeneity of disease adds yet another layer of complexity.

All of that is data, and everything can be processed and hopefully be used to train a large language model that will help us better understand the biology of diseases. Ultimately, I think we are just at the very beginning of all of this.

DA: I’d love to briefly touch on the second business model at Insilico Medicine. Can you tell us about the therapeutic areas where the company is focusing and to what extent you believe your platforms have enabled discoveries that may not have been possible using other approaches?

PK: We primarily focus on a few therapeutic areas: fibrosis, oncology, CNS diseases, and immunology. Our CEO is particularly passionate about aging, and so a lot of the diseases and targets that we pursue are implicated in aging, such as fibrosis, inflammation, and some of the key pathways associated with aging. There is great synergy in that many of the diseases that we’re investigating, whether they are chronic diseases or diseases that people are suffering from now, involve targets are also implicated in aging. What would be really cool is to see whether these drugs that we’re developing have that dual effect on patients: on the disease itself but also on people’s lives and their quality of life. I think there is the potential to unlock a lot of interesting outcomes from our pipeline.

DA: It seems like the study of aging aligns very well with what your platforms can achieve, since it is so inherently heterogeneous and has eluded more traditional, conventional approaches to target discovery.  

PK: That’s exactly right. Aging is not classified as a disease, but there are a lot of diseases that develop as the result of aging. In essence, if you’re investigating a disease, you’re looking into an underlying pathway that is probably linked to aging anyway, even though aging itself is not a disease. To date, as a pharma company, you still can’t really just say you’re targeting aging.

DA: You mentioned that you license both your software platforms and your pipeline assets. Particularly with regard to the software, how do those relationships typically work?

PK: We are very flexible and adaptable, and we work with different companies in different ways, depending on their needs. Every pharma company works in a unique way, and many are looking for a partner who can enable them to develop their own pipeline of therapeutics in their own way rather than take over their drug discovery programs. Those companies can license our software.      

Other companies are more interested in bolstering their internal pipelines with additional programs without taking time away from their focus with their internal resources, so they outsource the entire process. We can nominate an initial target and develop everything up to a stage where they are ready to in-license it as a partner.

DA: Since, as we’ve said, we are just at the very beginning of unlocking the potential of AI in drug discovery, clinical trials, and beyond, can you share a bit about Insilico’s vision of the full potential of AI and how you see it revolutionizing all of these areas in the coming years?

PK: Right now, I think that exactly what we set out to do is going to continue happening for some time. There are many, many stages of drug discovery and development, going all the way to commercialization. At the moment, we are just at the very beginning. At every single stage, there are definitely bottlenecks and challenges.

I think that what you’re going to see happening is that more and more of these challenges will be addressed using AI techniques. It’s just inevitable. In most processes there are certain things being done that are redundant, repetitive, or lacking in imagination –– not through anyone’s fault, but that’s just the way it is. For all those, you can adapt AI algorithms to help alleviate those bottlenecks, address challenges associated with insufficient imagination, and     improve and streamline the process. I believe      that’s what’s going to happen in our industry, piece by piece.

DA: In the near term, is Insilico Medicine more focused on further elaborating and tightening up these existing platforms or expanding and applying a similar approach to these different aspects of drug development?

PKBoth! We have thought leaders in the company that are really, really passionate about the products that we’ve created and further elaborating them. We also have innovators who are always thinking about the next thing and pushing the envelope. I anticipate many developments on both fronts.

Originally published on PharmasAlmanac.com on April 13, 2023.

Combining RNA Splicing and AI Technologies to Accelerate Drug Discovery and Development

Envisagenics is a techbio company that uses machine learning and advanced AI for the rapid discovery and validation of next-generation drug targets based on RNA splicing errors, which are associated with almost 400 diseases. 

Building Upon Successful Proof of Concept in Spinraza

Envisagenics focuses on oncology, neurodegenerative, and metabolic disorders. We have developed a proprietary target discovery platform to look for RNA-splicing-derived drug targets for antisense oligonucleotides (ASOs) and immunotherapeutics (including antibodies and cell-based therapies). 

The company was spun out of the lab of Adrian Krainer, Ph.D., at Cold Spring Harbor Laboratory, which studies mechanisms of RNA splicing. The lab was involved in the development of a therapeutic for spinal muscular atrophy (SMA), a neuromuscular disorder that is the leading genetic cause of infant deaths. This drug later became Spinraza — developed in collaboration with Ionis and Biogen — the first therapeutic approved by the FDA based on modulating RNA splicing.

The success of Spinraza and the impact that it had on so many children’s lives inspired us to found Envisagenics in 2014. We saw the promise of RNA splicing therapeutics and knew we could leverage AI and machine learning to automate, accelerate, and significantly improve the manual process of drug target discovery across multitudes of indications.

Building a Platform that Improves over Time

Envisagenics is at the forefront of a new era in biopharma — the emerging intersection of advanced AI and RNA splicing-based therapeutics. Through the years, Envisagenics has analyzed thousands of RNA-sequencing data sets to identify and validate quality assets into their discovery pipeline. Their AI-driven platform SpliceCore® continues to become more robust over time as it processes additional sequencing and experimental data. As progress was made with these in silico identified targets, Envisagenics has built out their own laboratory space to translate in silico findings into validated drug target candidates.

Envisagenics is one of the founding members of The Alliance for Artificial Intelligence in Healthcare (AAIH), a coalition of technologists, pharmaceutical companies, and research organizations that have a shared goal of realizing the full potential of AI and machine learning in healthcare. Since their inception, Envisagenics has remained on the cutting edge of AI drug discovery, executing multiple partnerships with biopharma companies that revolve around the drug targets discovered by SpliceCore platform.

The Importance of RNA Splicing

mRNAs are not consecutive coding sequences; they include coding segments (exons) interrupted frequently by noncoding segments (introns). Splicing is an RNA-processing step that removes the introns and connects the coding exons together to generate mature mRNAs. Without robust and accurate splicing, the production of mature mRNA molecules breaks down. 

Splicing is regulated by the spliceosome, a large, dynamic complex comprising over 300 proteins. As the biggest complexes in cells, spliceosomes often fail due to mutations and differences in gene expression. These splicing alterations often lead to splicing errors that can affect many mRNAs throughout the cell, some of which cause diseases. To date, approximately 400 diseases with diverse pathologies have been connected to splicing errors. 

Connecting Splicing Errors to Diseases

Cancers are often associated with spliceosomal failure and splicing errors in general. Many hematopoietic tumors have mutations in the core spliceosome proteins responsible for catalytic activity (cut-and-paste RNA). Solid tumors (breast, lung, colorectal cancer) tend to have altered expression of core, as well as ancillary, splicing proteins. The altered state of the proteins disrupts the splicing activity, creating a vulnerability for discovering novel targets and therapeutic development. Likewise, similar widespread misregulation in RNA processing underlies neurodegenerative and metabolic diseases. Aberrant splicing-derived transcripts in many cases drive disease progression and likely also contribute to the pathophysiology of the diseases.

Many RNA splicing diseases have been identified lately and, until recently, there was no easy way to profile the spliceosome. With mRNA sequencing technology, it is now possible to quickly focus on the spliceosome and look for alterations. As a result, in the last 10-15 years, more evidence of splicing errors has emerged. 

Given that the splicing-based therapeutics sector is still emerging, there are relatively few examples of approved therapies. Spinraza provided a compelling proof of concept and roadmap for further work, including treatments for Duchenne muscular dystrophy (DMD) and other neurodegenerative diseases. More candidates are progressing through clinical trials, many of which are related to neuromuscular disease. In oncology, there are several therapies under development that target the spliceosome itself with the intention of correcting splicing errors, with most currently at the preclinical stage and a few in clinical stages.

Genomic versus Splicing Approaches for Immunotherapy Target Identification

The current approach to identifying immunotherapy targets involves leveraging genomics data. A DNA sequence is evaluated, and variants that potentially encode neoantigens are identified. One of the biggest limitations to this strategy is that not every tumor exhibits multiple mutations that can be leveraged. Breast cancer, prostate cancer, and certain types of leukemia have low tumor mutational burdens, so it can be difficult to identify targetable neoantigens. The tumor types that have low mutation burdens are often rich in splicing errors, and a splicing approach can fill in the gaps and aid in the discovery of novel neoantigens.

If a neoantigen is identified through DNA analysis, it is still necessary to confirm the expression for which those genes are responsible. A mutation in the DNA sequence does not necessarily lead to a real effect on gene expression. Neoantigens found on mRNA are, by definition, expressed at the mRNA level. Therefore, identifying them by using mRNA rather than DNA can save time.

Using AI to Predict Drug Targets

predictive ensemble approach involves bringing together several algorithms in a voting system. Each unique algorithm trains a different model with different data to answer the same question and proposes the best answer. In the case of Envisagenics’ technology, the goal is to identify optimal drug targets.

Developing bespoke algorithms with very specific questions and training them with adequate data facilitates interpretation of the results. Understanding the reasons behind the prediction is also easier, which is critical in drug discovery. Drug discovery is risky, expensive, and time-consuming, and it is necessary to have a narrative to support the pursuit of drug targets proposed using AI.

The SpliceCore platform from Envisagenics uses RNA sequencing data and outputs drug target candidates for specific modalities. Different modalities have different requirements, so it is often not appropriate to use the same set of algorithms for the identification of any target. For example, antisense therapies must affect splicing regulation, while immunotherapy targets must be expressed on the membrane. 

Regardless of the modality, the incoming data are used to reconstruct the transcriptome, leveraging an exon-centric approach, focusing on smaller exons that contain all of the information needed for target discovery. Fractions of the transcriptome are created first, which allows the software to consider approximately 7,000,000 potential splicing events when constructing the transcriptome, rather than only about 30,000 genes. Consequently, the search base is much larger and enables the identification of pathogenic splicing errors.

In the next stage, a predictive ensemble is employed to seek optimal drug targets. The different algorithms look at different definitions, such as expression in a certain manner, protein stability, localization, antibody accessibility, regulator blocking, and so on. The platform synthesizes all of this information and generates a list of targets that are optimal for the greatest number of individual algorithms. No target is optimal for all predictors, because not all targets are optimally predicted in the same manner.

From the generated list, a pool of targets is selected to take into the lab for validation. Generally, the pool comprises a diverse set of targets to create resilience. Rather than using a funnel approach where a large set of potential targets is continuously narrowed down, Envisagenics uses many different approaches, because algorithms can also test perfectly on paper but fail in practice. The more algorithms included in the ensemble, the better. 

Overall, the platform interrogates approximately 3.5 million hypotheses per hour. This approach is compelling and unique. Ultimately, however, target validation is what matters. In the lab, Envisagenics works to show that these novel targets exist — that they are indeed being expressed — and that they behave as expected in terms of subcellular localization and mechanism, heterogeneity, and other relevant factors.

Continuously Innovating with a Focus on Talent and Science

Envisagenics’ success can be largely attributed to the mission-driven team that has been assembled. Hiring efforts focus first on talent and science. As a result, the team is highly diverse, both professionally and culturally. Everyone is passionate about developing splicing-based therapeutics for patients using AI — but everyone has different ideas and a unique vision of how to get there. By establishing an environment where people are encouraged to voice their opinions, we ensure that everyone is heard, and the myriad talents are fully leveraged. With shared values guided by our mission to help patients in need, we can inspire insight-generating discussions that result in innovative ideas and solutions. 

One of those shared missions is constant innovation. The team at Envisagenics is constantly thinking about how to integrate new algorithms and new data types into the platform. The data science team in particular focuses on creating new ways to identify drug targets, efforts that are shaped by the evolution of data types, developments by other groups reported in the literature, and general technological advances, such as new cloud-computing technology. 

Envisagenics’ Partnerships

Partnerships are a core part of the Envisagenics business model. In general, each research collaboration focuses on a specific disease or disease subtype. Determining which disease indication the company should focus on requires synthesizing many factors. The need in the marketplace is one consideration, as is the feasibility of the disease with respect to the potential for identifying splicing-derived targets, and the ability to validate identified targets in the lab. 

It is equally important to assess the availability of data. The ideal scenario will involve thousands of patients and five or six independent cohorts to evaluate reproducibility and other attributes. Within Envisagenics, a group is dedicated to searching for data and creating separate collaborations based on data. One example is a current collaboration with Cancer Research Horizons, in which Envisagenics is performing research using their patient data.

For biopharma partners tackling relevant disease indications, meanwhile, Envisagenics identifies optimal candidates, and the partner uses its expertise in drug development to take the assets through the clinic. Over the years, the company has worked with a range of biopharma companies, such as Johnson & Johnson and Biogen. Most recently (November 2022), a collaboration was established with Bristol Myers Squibb to leverage Envisagenics’ SpliceCore AI platform for accelerated discovery and development of oncology splicing-derived targets for therapeutic development to expand BMS’ oncology pipeline.

Over the near and intermediate terms, Envisagenics will pursue additional partnerships with biopharma companies in which the SpliceCore platform is leveraged to identify novel targets in combination with the biopharma partner’s expertise to successfully develop therapeutic candidates. In the future, the company intends to advance its internal assets as far as possible. 

Bigger Role for Data and AI Beyond mRNA Splicing

Envisagenics’ expertise lies in leveraging AI to analyze mRNA splicing. The role of big data and AI in biopharma is the next frontier of scientific advancement. Society today is data-driven and, given that biopharma is a place where smart scientists and technology coincide, innovation in the field cannot take place without leveraging the latest developments in big data and AI. 

Many companies, Envisagenics among them, will be looking to eliminate inefficiencies and bottlenecks to speed up drug development. Some will focus on target discovery, others on drug design, and yet others on improving clinical trials. Some of these companies will grow and scale into full-blown biopharmaceutical firms that happen to integrate AI into their processes.

Envisagenics is motivated and inspired by the promise that advanced AI technologies. Using these technologies, we can streamline and optimize the drug discovery process, reducing the cost of medicines, ensuring that treatments are readily available to patients in need. These goals are already in our reach.

Envisagenics secured Series A financing in September 2021, which was led by Red Cell Partners, and which included follow-on investments from Dynamk Capital and investors from Envisagenics’ seed rounds, including Microsoft’s M12, Madrona Venture Group, Third Kind Venture Capital, and Empire State Development’s venture capital arm, New York Ventures. Envisagenics’ investors understand and uphold the company’s goals and have the vision needed to support the development of therapeutics through the combination of classic research and AI.

Originally published on PharmasAlmanac.com on April 12, 2023.

Using Artificial Intelligence to Get the Right Drug to the Right Patient at the Right Time

For years, we have known that we have an abundance of real-world data and evidence that could drive real insights and enable much more personalized medicine — in which a drug is more precisely matched to a patient rather than a more generalized population — but that it is largely hidden in clinical narratives and unstructured electronic medical records preventing the data from being analyzed at scale. Recent innovations in artificial intelligence and natural language processing technologies, however, are removing these barriers and helping to realize the promise of real-world data and personalized medicine. Stefan Weiss, M.D., FAAD, Managing Director of Dermatology at OM1, a real-world data, outcomes, and technology company with a focus on chronic diseases, discusses the promise of these data and how OM1 is working to overcome the inherent challenges, with Pharma’s Almanac Editor in Chief David Alvaro, Ph.D.

David Alvaro (DA): Can you introduce OM1 and the mission to which the company is dedicated?

Stefan Weiss (SW):  OM1 focuses on finding new ways to drive improved patient outcomes. All of healthcare and medicine ultimately aims to extend the lives of patients and at the same time offering a better quality of life. OM1 approaches that broader mission from the standpoint of data to determine how best to leverage healthcare data to optimize outcomes. How does data drive our ability to make healthcare more efficient? How can we improve clinical trial design by understanding the patient population? How can we identify patients who tend to be underrepresented in clinical trials and incorporate those groups more into clinical investigation so that we develop drugs for the entire population that suffers from a given disease rather than only a portion of that population ?  How do we use data to understand where certain therapies work better or worse, either intrinsically or within different sub-populations?

Over the last 50 years or so, we have seen that different patient populations — whether classified by race, gender, or other discrete categories — have different responses to drugs, including immunology. If, for example, certain population groups have better responses to a particular mechanism of action for psoriatic arthritis, we want to ensure that we can explain those data and present them to the greater scientific community so physicians know that the corresponding product would be a better or worse choice for patients within these groups.

Real-world data are also critical to truly understanding the safety of a particular product, because the patients enrolled in a given clinical trial may not always be fully representative of the patient population that will ultimately need that drug. It’s important to investigate whether we find the same side effect profiles and safety outcomes in large community-based populations that we do in the clinical trial setting.

All of those areas of big data analysis have become critical to driving improved outcomes and improved health status of populations. At OM1, we focus on three of the four primary areas of drug development — immunology, cardiometabolic medicine, and mental health / neuroscience — and all of the disease states within those areas, representing chronic conditions with significant unmet needs. If you can intercede appropriately, and drive better outcomes in these areas, you can have a real impact on people for a very long time.

DA: What are the foundational technologies that OM1 leverages to unlock all the secrets within all of those healthcare data?

SW:  We have a number of different technologies that play and build upon each other. The first task for the technology is to efficiently extract data from disparate sources. Community physicians use different EMRs (electronic medical record), hospitals typically use EMRs that differ from those community physicians, pharmacies and pharmacy claims providers use their own systems, and medical payments come from yet another system. Until you can get all of those systems to speak to each other in a language that everybody understands, those data remain siloed, which prevents generating a full picture of the patient. A key differentiator for OM1 is that we center  on the patient, whereas many other data vendors center on the  claim.

We see the world of medicine as driven by the human beings who suffer from disease. If you start at that point and focus on the clinical description of the patient journey, you can then bolt on data from a host of other sources using a process of tokenization. Our technological innovation focuses on consolidating data from different EMRs into a usable format; with that foundation, it is possible to layer on a variety of different data sources.

Central among these technologies is natural (or medical) language processing to allow the “machine” to read the notes. Certain medical specialties are very structured in their data — think cardiology as it captures  blood pressure, cholesterol levels, etc. — simple, consistent, and repeatable numbers. On the other hand, notes from a psychiatrist typically take the form of an hour-long re-telling of a patient’s story without much structured data. Somebody — or something — needs to read that note and pull out the important parts. I’m a dermatologist, and for dermatologists, it’s very much a clinical narrative. “Red, itchy rash on elbow” describes the patient’s story and provides a huge amount of information to support our understanding of how that disease impacts the patient. In dermatology, that becomes critical, because as we think about treatments, we want to know where the individual is being impacted.

When converting that narrative into something more structured, disease presentation and disease location are two of the most important fields. The next priorities would be the severity of the disease and how it is measured in a standard outcome. Sometimes, we record it in a structured field, but more often than not it’s recorded in that unstructured narrative. If a person has moderate psoriasis or a BSA of 10%, we won’t be able to use those data unless the machine can read them. But if it can, we can track disease severity, and then bring in AI on top of that to apply big data models to create disease estimations. 

In mental health, we have a whole collection of PHQ-9s (patient health questionnaires), which is a disease severity score for people with depression, but it isn’t always recorded consistently. However, if there’s enough information captured in the notes, the machine can process and understand the patient’s overall mental health status and assign a category level, or ePHQ9. The ePHQ9, or estimate PHQ9, has a high level of concordance with the recorded PHQ9 and thus becomes a very useful tool for research.  

Understanding what disease severity looks like at a single point in time allows you to track the severity over time, which provides a better understanding of the response to a given drug. Is the patient’s disease severity improving, worsening, or staying the same? What side effects is this person having (weight gain or loss; r more itching; sleeplessness)? How do those symptoms correlate with the severity of the disease and/or the improvement of that individual’s disease severity with a change in therapy?

Gathering all of those data can help us understand why people stop taking a drug, which allows us to create discontinuation reports. We’ve all been prescribed medication for one condition for another, and then at some point that medicine changes. But why? Is it because we weren’t responsive, because we were experiencing side effects, or because we suddenly developed another illness for which that drug may have been contraindicated? All of that is typically documented in the clinical narrative in the EMR, but there’s no structured field for it to be easily recorded and extracted.

Once a machine structures those data, it can create models to characterize this phenomenon, which enables insights. Hypothetically, we know that 32% of patients with a particular immunologic disease (i.e., psoriasis) stopped drug A because of a contraindication or because of a lack of tolerance. We can even drill down more precisely to weight gain, weight loss, or diarrhea. Researchers can then understand — at scale — why patients are stopping a particular drug or why providers are stopping patients on that drug. This type  of data is much better than what was formerly possible: doing market research of 15 physicians, hoping that they can recall why they may have stopped drug X for disease Y, and then extrapolating across thousands of physicians the insights from those 15.

The AI model also enables our PhenOMTM platform, which is focused on phenotypically fingerprinting a patient to both help make better sense of clinical and real-world data and enable more personalized medicine. Are there particular patients who would respond better to drug X for rheumatoid arthritis, for instance? The best way to understand that is to analyze patients who have done well versus patients who have done poorly on that drug. Once that has been established, how a new patient corresponds to those subtypes of patients can be linked. For example, it becomes possible to infer that, based on the phenotypical fingerprint of the patient, that patient would have an 80–90% chance of responding well to drug A but only a 30–40% chance of responding well to drug B.” This personalization facilitates an unbelievable shift in the arc of medicine in terms of getting the right patient on the right drug at the right time.

DA:  Before natural language processing and AI became as advanced as they are today, discussions about leveraging real-world data and evidence often emphasized the need to build more of that structure you discussed directly into the EMRs. Have you seen much progress on that front, and is that still critical?

SW: It will always be easier to capture data that are already in structured fields, and having those structured fields within an EMR serves as a reminder to providers to document those individual data points, which is important. However, a really transformative change in structuring and standardization of data elements would need to come from the EMR companies. However, because there isn’t a single EMR used even within a given specialty, buy-in would be required from every EMR and every provider. As providers have fought technological intrusions into clinical medicine for decades, this would just add more fuel to that fire.

DA:  I wanted to speak more about the Reasons for Discontinuation (RfD) reports that OM1 has published. Can you discuss your vision for how these data can be used and the impact the reports can have?

SW: The RfD reports present very important information that could never before be analyzed at scale because most of the reasons for discontinuation of a drug have always been captured in narrative form. Without a way to aggregate all of this EMR information and validate it comprehensively, the industry could only do market research in focus groups to understand the perception of a drug and why it was started or stopped in very small segments of clinicians. With these RfD reports, discontinuation at scale across broad swaths of a population, assessing both a diversity of providers and a diversity of patients can be investigated. The reason why those using the drug then stopped can now be identified.

The first reports that have been launched focus on treatments for chronic inflammatory diseases because of the breadth of data within dermatology and rheumatology already in the system. This allowed the models to be tested and validated more easily. After the process was established, it became possible to expand into mental health diseases such that  an enhanced appreciation of how drugs targeting depression or schizophrenia are being utilized in the marketplace could be obtained, as well as why some of those therapies have better or worse persistence than others.  

The RfD reports are also expanding our understanding of how subtypes of patients respond to a given drug. This will enable clinicians to make better determinations about whether a given patient from a specific sub-population of patients is more/less appropriate to start on a given drug. The RfD report can either confirm or dispel myths: a side effect that may only have been experienced by a handful of people could have been assumed to be more prevalent (i.e., a sampling bias). If a physician only treats a small number of patients with a condition and two of them randomly experience a side effect, and the physician publishes a case report, it may be incorrectly extrapolated that the side effect is far more common than it is in actuality. With more data and more evidence, a much better understanding of which side effects are actually widespread versus those that are just amplified anecdotes can be achieved.      

DA: To what extent do you see — now or in the future — an intersection or synergy between the work you are doing with the RfD reports and PhenOM?

SW: For now, the reports are separate, but it is easy to envision how the analyses could work together as each expands. It’s typical to launch new products in silos and then watch the inevitable intersection. Imagine being able to design a clinical trial that specifically recruits patients who have failed or had specific side effects on a given drug because the trial is investigating a drug that isn’t believed to have those side effects. Ultimately, clinical trial insights, PhenOM, and RfD all drive toward the same central peak: improving patient outcomes.

DA: In the long run, how do you see real-world data and evidence intersecting with and impacting traditional clinical trials models?  

SW: I see real-world data as complementing rather than supplanting what we’re doing in clinical trials. For example, large real-world data sets enable clinical trials to have better control arms. Clinical trials typically compare investigational drug A versus placebo. However, in certain disease states, do you really want to deprive any patient of treatment? Obviously not. Thus, the answer is to collect real-world data on matched controls.

For example, we can design a clinical trial investigating a new drug for a rare immunologic disease versus methotrexate, a common therapy used across immunologic diseases. Methotrexate may work reasonably well. Is it appropriate to deny a patient  methotrexate and replace it with a placebo, in order to understand if and how the investigational  drug works? We are better served by comparing the investigational drug to a control arm of similar patients on methotrexate in the real world.  

Real-world data also again offer the opportunity to meet some of the FDA’s initiatives that seek to ensure that a diversity of patient populations is being studied in clinical trials. Without diverse representation in trials, it will be challenging to assess whether the drugs are going to work in the overall population versus only in a specific population that is located geographically near more traditional research sites.

DA: I also understand that OM1 has been working with the American Academy of Dermatology (AAD) and DataDerm to explore drug safety implications and personalized medicine in dermatology? Can you discuss that?

SW: DataDerm is a registry created by AAD in 2016 — with data extending back to 2013 — that tracks data across patient populations that are being seen by a large representation of the community of dermatologists in the United States. It was designed largely to facilitate dermatologists qualifying for the meaningful use requirements that were set up by CMS (Centers for Medicare & Medicaid Services) as part of the Medicare value payment structure. However, in so doing, it became a  rich source of data on dermatology patients that can be used to explore many of these research questions.

AAD is the voice of the dermatology community, patients, and physicians.  Partnering with specialty societies is an excellent opportunity to work with established authorities and leverage knowledge gleaned outside the specialty  to support dermatology. A similar relationship exists with the American Academy of Otolaryngology to focus on diseases that are relevant to that patient population.  

DA: Beyond everything we have discussed and what is forthcoming on those fronts, is there anything else you can share or tease about what else may be coming in next few years for OM1 or from this sector more broadly?  

SW: The overall focus will continue to be rolling out more and more products that focus on the personalization of medicine. Personalized medicine has been a goal in medicine for more than two decades. How is it achieved? When a patient walks into a doctor’s office with a particular disease, how can the doctor know if that patient is really well suited to secure the desired outcomes from a particular drug? I really do believe that this long-sought goal — the right drug for the right patient at the right time — is closer than ever.

Everything builds upon everything else. As has been suggested, the RfD reports lead into PhenOM, and PhenOM leads into clinical trial insights. All of that is predicated on having really good models of AI to understand disease severity, which can only be accomplished when the technology to obtain and “read” the clinical notes has been optimized.

If all of this is considered as a series of building blocks, much of the foundational work has been accomplished. The question now is how best to leverage all of those learnings to create an environment where that level of personalization can be achieved. After that, the focus can broaden: start with a particular disease like psoriasis or therapeutic discipline like immunology, and then take the lessons learned and bring those insights into heart failure or depression. From there, move into epilepsy or obesity.

Working one step at a time allows skills and capacities to be built, tested, and validated. Most of the immunologic diseases are similar, so the lessons learned in rheumatoid arthritis (RA) can play out in psoriatic arthritis (PSA), which we can then play out in psoriasis, and so on.

It’s not really different than how drugs get developed. Take Humira, which was the best-selling drug in America for almost a decade and addressed 10 indications. All 10 of those indications were not pursued at the same time. First, it was shown to work  in RA, and then in PSA. Seeing that it worked really well in PSA, psoriasis was studied, and so on. I think that we are following a model of product development that is very similar to the way that our partners on the pharmaceutical side approach complex problems. By validating and testing the models and proving out the research, we can jump from one therapeutic area to the next one, as was evidenced with the RfD report. The approach was validated in rheumatology and dermatology, and now it is being launched into mental health. If that works, it can be then applied to neurologic disorders, cardiac disease, and beyond.

DA: Do you think that the underlying AI technology is sufficiently advanced at this point to achieve those aims and it is more a question of training it on more data, or is more technological evolution still critical?  

SW: I think the evolution will continue, and we will keep tweaking what we are doing as things evolve. Had we not demonstrated the ability to develop a disease estimation model in RA, where we had tremendous quantities of data to train and test the model, we would have never figured out that the technology was applicable to hidradenitis suppurativa (HS). Staring with HS would never work, however, because there just are not enough data points to say with confidence that such an approach works. But it is possible to transition from RA to multiple sclerosis and then to Crohn’s disease, PSA, and ankylosing spondylitis and eventually refine the models to approach a more rare disease like HS. All of the work builds on what came before and enables what comes next.

DA: Is there anything you’d like to particularly underscore as a closing thought?

SW:  Everything reduces to one central theme: how to drive better outcomes for patients. At the end of the day, that’s why those of us in our respective areas of medicine and healthcare do what we do. It’s a mission-driven activity. I began my career as a physician to help individual patients who walked into the office. I’ve spent time in drug development, and I’ve brought multiple drugs to market for patients who suffered from psoriasis and atopic dermatitis but needed a better therapy than what was available.  Now, with big data,  I can take a larger perspective, but the real focus remains the same: how we can deliver the right drug to the right person at the right time.

Originally published on PharmasAlmanac.com on May 2, 2023.

Integrating Experimental Data and Artificial Intelligence to Accelerate Drug Discovery

Leading preclinical contract research organization Charles River Laboratories and human data–centric artificial intelligence (AI) technology provider Valo Health recently announced the launch of LogicaTM, a collaborative AI-powered drug discovery solution that leverages the expertise and experience of both partners to rapidly deliver optimized preclinical assets — at both the advanceable lead and candidate stages — to pharma clients. Charles River’s Executive Director of Business Development, Early Discovery Ronald Dorenbos, Ph.D. and Valo Health’s Vice President of Integrated Research Guido Lanza discussed Logica, the inefficiencies in drug development it seeks to overcome, the underlying business model, and what the future holds for the partnership, with Pharma’s Almanac Editor in Chief David Alvaro, Ph.D.

David Alvaro (DA): To start things off, can you tell me about the inception of the partnership between Charles River Laboratories and Valo Health and why both organizations felt that there was a potentially productive synergy between them?

Guido Lanza (GL): I don’t believe that I’ve ever been a part of a partnership where the vision and the framework for achieving it came together as quickly as what occurred between Valo and Charles River. I think that was possible because the idea had already been incubating for a very long time, essentially independently at each company. Charles River had a vision of undergoing a very deep digital transformation that would enable the company to combine, unite, and extract more value from the data that they generate across their operations, which would unlock some significant new opportunities.

There was a complimentary vision on the Valo side. While Valo is a relatively young company, the relevant digital platform for computational drug design was built in part through acquisition of another company called Numerate, of which I had been CEO. We felt that if we could figure out a way to partner with a data-generation powerhouse like Charles River, we could overcome some of those bottlenecks.

Ultimately, setting up our first meeting to discuss what our combined capabilities could offer was the trickiest part. After that, it was a smooth journey to establish the actual model, what the offering would look like, and the benefits to the customer.

Ronald Dorenbos (RD): Over the last 10–15 years, we’ve seen hundreds of millions of dollars poured into the industry, with lots of AI companies trying to perform drug discovery and development using AI alone, which has not been particularly successful. While AI unlocks all kinds of new possibilities, it’s clear that it’s not sufficient on its own for productive drug discovery. This collaboration was designed to advance to the next logical step: Valo brings extensive AI expertise from the chemistry perspective (because a lot of people there are also chemists themselves), which combines with the powerful data generation and experimental engine that Charles River has. Charles River provides an enormous arsenal of capabilities that are unmatched by any other company in the world. By combining the considerable traditional drug discovery capabilities of Charles River with Valo’s AI expertise and technology, we could create a very effective platform that could really advance drug discovery, which we have named Logica.

DA: Can you expand a bit about the conventional drug discovery and development process and where you see the most critical bottlenecks or inefficiencies that inspired the platform and why this combination of traditional discovery and AI is the most sensible way to overcome them?

GL: I’ve been working in the AI space for over 20 years. If you look at the history of the deployment of AI or machine learning (ML) and where it has had an impact, most has occurred within the traditional siloes of the pharma industry: image analytics as a screening platform, virtual molecule design, and so on. The data and the algorithms unlocked a lot of new possibilities, but they operated within the traditional chevrons. As a result, we saw a great opportunity to rethink the whole paradigm by removing the traditional chevrons and focusing on the real moments of kind of value generation.

I would argue that there are three key moments of value. The first is performing some magic biology — omics or the like — and finding a target. The next is a chemistry that allows you to test a hypothesis that is advanceable and patentable. And the third is the moment in which you have a candidate that is ready to enter IND-enabling studies and beyond on the road to the clinic. At every point in between, you can’t really be certain how close you are to that value. So, we wanted to take a step back to focus on defining those value-generation points and assessing where we are underutilizing data that could increase our chances of reaching those points.

AI essentially provides a means of cheating and looking into the future, or at least a good simulation of it. For example, if we have a good model for tox studies, we can simulate the results of those studies much earlier, which reduces or improves the odds of success downstream. If you break down siloes, you can use the data about future success and failure to inform decisions today. AI lets you melt away those chevrons and think about data as something more fluid that supports the reduction of uncertainty, which can allow you to apply totally unrelated data from a different project to guide your decisions.

I can’t imagine a greater data generation platform than Charles River, who supports more than 1,300 IND programs every year. We just needed to figure out how to unlock the value in those data for future programs to increase the chance of success, or at least to help programs to fail fast and early rather than later and at greater cost.

RD: We see Logica as version 3.0 of applying AI and drug discovery. Version 1.0 had a very narrow problem scope, a siloed approach, an inability to extend the analysis beyond the initial problems and no intentional, large-scale data generation. Version 2.0 added a limited amount of data generation, as well as expansion into broader problem categories and some wet lab access. With Logica, we are breaking down those siloes across early drug development, integrating wet lab work with the AI capability, and focusing on cycle numbers and data intentionality to, as Guido was saying, predict a likely future as early as possible.

DA: With Logica, were you looking to tackle all relevant pain points in drug discovery or begin with some low-hanging fruit and then build up to more complicated challenges?

RD: Before we start a project with a pharma or academic partner, we take a good look at the target, because targets come in all kinds of different varieties, from easier ones to approach, like kinases, down to RNA, epigenetic, and more exotic targets. Across more than 25 projects, we have had more than 90% success. Logica has processes and methods that enable it to work on various types of targets, of varying difficulty levels.

While the technology is essentially target agnostic, we always perform a feasibility study at the start of a project to determine which targets we feel comfortable with pursuing, because we don’t want to get involved with a project that our platform and our experience indicates has a very low chance of success. To that end, it helps that Charles River is such a large organization, with around 18,000 people, including many with 15–20 years of experience working at the major pharmaceutical companies, like Pfizer, Novartis, Merck, AstraZeneca, and GSK. That experience helps with the feasibility studies but also in navigating certain challenges and bottlenecks.

GL: Over time, the offering will get better and better: the output quality will improve, and the time will be reduced. At a high level, this all helps better align the CRO model with what the customer wants: they want the best product as fast as possible, we do better when we can make that happen, and doing so improves our platform so that the results are even better in the future.

In traditional drug discovery, you typically start off by running a screen, and then build your set within the universe of compounds that comes from the screen, analogs of those compounds, and so on. What we do is a little bit different; we see three parts to the process. The first is the generation of data to train the model. It’s great if you can understand the chemotypes and get to some starting point, but what you really want is information as the very first path. That whole universe is flattened in our mind because it’s a data-generation universe.

The second step is to unleash that on very large spaces of chemistry that are bespoke for your problem — if the first space is tens of billions of compounds, you want to go even larger on your second space, evaluating hundreds of billions or trillions of compounds specifically designed for your problem. Then, third, you want to pick the series that are most advanceable, because you’ve made millions of virtual analogs of those and simulated your future against models of all the things that can go wrong. Some series are of course intrinsically going to be better than others, so the ability to measure that a priori sets you up for success later. This provides a significant quality advantage because you’ve looked at so much more information about that series than people typically would.

RD: It’s critical to start with the highest quality and value of compounds. Obviously, clinical trials are coming further down the road, and a lot of molecules will eventually fail in trials, but if you can increase the chances of success even by only a tiny amount, that will have tremendous benefits. So, it’s not just a case of better molecules but of an increased success rate further down the road to help get these molecules to the market and the patient.

GL: We work with clients ranging from early seed companies all the way to big pharma, and they have very different drivers: the pipeline, the timing, or the cost. We offer a model that is very transparent and very straightforward: six to nine months for the first phase to get to the advanceable lead — which we call Logica-AL — and then another 12 to 18 months to get to the IND-enabling candidate that is ready to go into GLP tox and safety — which is Logica-C.

The whole trajectory of going from scratch to an IND-enabling molecule takes at least 36 months. Logica can get there within 18 months; if we run into some challenges or need to set up special assays, that may extend to 27 months, which is still significantly faster than the traditional method. And being able to reach one critical conclusion in six to nine months and the second in another 12 to 18 months is very attractive.

DA: As you mentioned before, the more data that you put into an algorithm like this, the more refined and accurate it becomes. To that end, are you able to leverage data from customer projects to feed back into the platform, or do you run your own internal experiments to generate data?

GL: There’s a continuum. Some customer data is pre-competitive, and some is not, so there are some kinds of data that customers are quite willing to share and others that they generally are not. In some cases, we have to generate our own data or import published data.

The questions “Can I use the data to learn from?” and “Can I see how my model did?” are very different. Both Charles River and Valo have a lot of experience handling confidential customer data and building the appropriate firewalls, which helps customers be confident that we will only use their data in approved ways. Of course, many customers see the value of more people sharing data and how that benefits their projects and are very happy to share what isn’t hypersensitive.

DA: Can you explain the business model underlying the Logica platform?

RD: We use a risk-sharing model where the cost is tied to success and the creation of value, which aligns incentives with the customer. Rather than charging on the basis of the number of experiments run or the hours needed, most of the payment is tied to those moments when the customer receives real value. We typically divide things into the two phases we discussed, but everything that is needed to reach the advanceable lead series is included in the milestone payment for that phase. Sometimes people ask us how many FTE hours they get for the price, but that’s not really a relevant question, because it’s as important for us as for the client to reach the milestone — that’s how we get paid and how we advance to the IND-enabling phase. If the client then wants us to pursue optimization, there is a continuation payment, which is typically higher than the payment for the first phase, because this second phase requires more lab work, chemistry, and animal experiments. Then, after we spend another 12–18 months to get to an IND-enabling candidate that is consistent with the target product profile and the specifications that were agreed on at the beginning of the project, there is another milestone payment. Finally, clinical milestone payments and royalties will come into play when the candidate moves through the clinical phases and goes to market. The client’s success is our success, and we keep everything very straightforward and transparent.

DA: What response have you seen from the market? Has it been relatively easy to convince potential customers of the value of this approach?

RD: There is a great hunger for a value-based offering in the small molecule discovery space. At the BIO International in June, I spoke with many people who were really excited about this new approach to drug discovery, and we are in further conversations with many of them. We are in discussions with big pharma companies, venture capital firms, small biotech companies, and seed companies from universities, and, across the board, people are enthusiastic and see Logica as a great model that could fit into their strategy.

I have not heard really anything negative, but we are relatively new and still need to build our track record and our history together to match the very strong individual track records of the two companies.

DA: Assuming that Logica leads to the optimal outcomes you envision and ends up in widespread use, how transformative do you think it could be for the industry as a whole and the ways that drug discovery is conducted?

GL: At the moment, there’s an interesting economic argument to be made for Logica as a small molecule generation engine for various types of things. If we can consistently fix the uncertainty in small molecule discovery, you’re re-empowering the people that are doing the earliest work — if we can level the playing field on chemistry, then biology will become the dominant piece.

Those who can best define human disease and translate those definitions into preclinical models will be the winners of the future, not those who can have the biggest libraries. For example, Parkinson’s disease is currently defined by the FDA and the ICD9 code as a single disease, but in reality it’s probably 50–100 different diseases. Those that can better define targeted subpopulations and develop specific compounds against them will benefit. But before you begin, you can focus on your translational path and your translational journey and establish a patient ID for a compound even before running a screen. If we can define that journey in a frictionless way, we totally change the economics by dramatically increasing the plane of symmetry (POS) and quality of the compound, which empowers the people that are defining the disease as the value-generation hub.

That’s where I think AI is going to make the biggest impact after Logica, because that’s where you have complex, high-volume data reflecting all the omics on a per-patient basis. To me, that’s the really exciting development a little way down the line.

RD: Everything boils down to getting better medications to patients faster. What Logica can unlock is the ability to make the whole process more efficient and more economically attractive and to operate as a well-oiled machine, where you also can consider targets that you normally would not consider because of cost concerns. As Guido often puts it, Logica is “democratizing” the process of drug discovery and the AI capabilities for a much wider audience, and the whole world will benefit from that.

DA: Before we wrap up, is there anything you can share about what might come next for Logica or for the partnership more broadly? Is it possible to build on this success and tackle large molecules?

GL: We are looking at other modalities, although we can’t disclose anything right now. Beyond that, we want to advance the concept to predict ever-more complex phenomena. That intersects with the need to avoid or minimize failures and how they become costlier the later in discovery that they occur. We continue working on determining the best sources to inform our design decisions. Where we really want to push the envelope is in making sure that we’re not modeling intermediate steps that are poor proxies for the ultimate goal but instead finding the best proxies and focusing there.

RD: The expansion of our platform will be tied into what’s happening in the rest of the field. Lots of groups are applying AI and machine learning to particularly complex biological systems: neuroscience, gastrointestinal disease, oncology. These technologies can read all relevant published manuscripts and become much more effective at natural language processing, which will also lead to new insights and new targets, which can be combined with our efforts to use these targets to develop new drugs. I think we will probably see that happening more in the future.

Another potential impact suggested by the insights we’ve gained from the AI models that interact closely with the wet lab work at Charles River is that Logica can also help reduce the wet lab work that needs to happen. We can scale down the numbers of the animals that need to be used for these kinds of studies and some of the assays, even getting rid of some assays altogether, because the AI can predict the result without needing to perform even one experiment. I think that any of these adjacent areas where a lot of development is occurring will become very important to how Logica develops into the future.

Originally published on PharmasAlmanac.com on December 7, 2022

Accelerating and Improving Bioprocess Development with Machine Learning Solutions

In an industry where precision and efficiency are paramount, DataHow has emerged as a pioneer in leveraging advanced machine learning techniques to optimize process development, manage risks, and support data-driven decision making. Bridging machine learning from big data to the far smaller bioprocess data, DataHow’s core innovation lies in its hybrid modeling technology, which combines process data with engineering knowledge to enhance process development and manufacturing robustness. In this Q&A, two of DataHow’s founders, Chief Executive Officer Alessandro Butté, Ph.D., and Chief Operating Officer Michael Sokolov, Ph.D., discuss DataHow’s journey from concept to implementation, highlighting the transformative potential of their solutions in accelerating process development, reducing errors, and facilitating a more agile response to the dynamic demands of pharmaceutical production, in conversation with Pharma’s Almanac Editor in Chief David Alvaro, Ph.D.

David Alvaro (DA): To begin, as two of DataHow’s four founders, can you give us a concise history of the company’s origins?

Alessandro Butte (AB): My background is in academia — I’ve been in the university environment for more than 20 years. At a certain point, I decided that what I was truly passionate about was solving practical problems around what my colleagues called technology, and not typically in a nice way. I made a transition into industry when I entered into a collaboration with Lonza — one of the largest pharmaceutical CDMOs in the world — to explore the use of modeling techniques to support quality by design.

At that point in time, when a CDMO engaged with a new client to produce product for a clinical study or commercial phase, they essentially had to start everything from scratch — a blank page, as though they had never developed such processes and had no useful data to leverage. I found that very frustrating, because it was incredibly inefficient, new learnings were constantly being completely lost, and there was inherently a lot of uncertainty surrounding the processes because they were almost entirely based on the quality of the scientist in charge of development of the project.

To me, the clearest solution to these issues involved a tool like machine learning, which could enable processing of a huge amount of data and finding rational paths that could be used to support process development work. At that time, the main challenge was that machine learning had always been associated with big data. If you’re talking about statistics, the more relevant data you have, the more powerful the results. However, pharma typically involves very small data. The real challenge lay in determining how to use such powerful tools with only a small number of experiments, maybe just 10 to 20 or so in the path from start to manufacturing. The answer was hybrid modelling, which is the core of our technology. Hybrid models are very complicated mathematically, but quite simple in concept: they involve combining two sources of knowledge — in this case, process data and engineering process knowledge — and combining mechanistic and machine learning models to reach an outcome that reduces the number of experiments (or the amount of data) needed to optimize a process, while adapting to the specifics of the considered process. Mathematically, this involves constraining the space of the solutions that a machine learning tool can find using prior knowledge from equations and adaptive learning of some components of these equations as a dynamic non-linear function of the process control conditions.

Michael Sokolov (MS): To put it another way, we identified a gap that needed to be closed. Machine learning was a technique that was exploding in some other fields that are spoiled with more data, so our goal was to figure out how to leverage it in an environment in which every data point inherently comes at a very large labor cost, in spite of the considerable complexity of bioprocesses to be solved.

AB: The beginning of DataHow’s journey was determining whether it was possible to decrease the number of experiments to a level that competes with the average number of experiments used to develop a process today. Today, our journey has transitioned to following our main vision — supporting pharmaceutical companies, CMOs, and so on to improve their manufacturing data, especially process quality data, to make development way faster and more robust, to decrease a lot of the errors and failures in manufacturing, and to accelerate process development and allow pharmaceutical companies to handle larger pipelines because they need less resources to develop a process.

MS: It is important to note that the vision takes different forms depending on the processes or the underlying modality being explored. For well-established bioprocesses, such as the production of therapeutic proteins through platform processes, this technology is very likely to be of great help to accelerate programs, reduce costs, and transform how people are operating on a two-digit percentage basis: cutting costs and timelines by maybe 30–70%. However, in the new modality space, where processes are not yet well understood or established, we foresee the technology having an enabling effect — it simply might not be possible to bring a certain therapy to the market at all, or at the speed required for patient needs, without a digital technology playing an integral role. That applies to things like cell and gene therapies, but also food tech, cultivated meat, and so on.

DA: To realize that vision, did you intentionally build a team with different expertise and contrasting viewpoints and priorities?

AB: We have always aimed to bring together very different points of views on the technology, and in some cases on our strategy and tactics. The team comes from a range of backgrounds: some people are more academic, others are more industrial, and so on, and hence also very different personal experiences. However, we have very much focused tactically on the key concerns of the sector in which I was working before, simply because you have to have a focus. We already had a lot of expertise in that sector, not to mention contacts we could speak with to hash out ideas. When we speak with clients, we have been able to offer advice drawn from our experience, our knowledge of the science and the data, and our understanding of the business. We can aggregate all these different perspectives in a coherent strategy to support our clients’ goals.

MS: On the big vision, I think we have always been very uniquely aligned. However, we feel that it is critical to continuously realign on more incremental details based on the feedback we receive from clients. The constant prototyping and exploration of solutions with clients led us to understand those needs better, especially across different segments. There is big pharma versus CDMOs and small biotechs. While they all converge on bioprocess, they have different expectations — some are looking for optimization, and others for enabling technologies. On the other hand, we had to understand who the potential users might be in-house and how we could best help customers embrace the technology as part of their organizational digital transformation.

AB: At the same time, the way we are perceived by customers is continuously changing. In the beginning, we were more experts brought in to consult, whereas today we are solution providers and even software providers, which is reflected in a radical shift in the discussions that we have with clients, who have started playing a more active role in that digital transformation.

DA: I’m sure we could spend hours on the nuances of your technology, but could you give me a concise explanation of the key principles and their importance in achieving your goals and those of your customers?

AB: Our technology is based on three main pillars. The first, as I mentioned earlier, is hybrid modeling: machine learning for small data, which unlocks the ability to autonomously learn from process data.

The next pillar is a direct consequence of the application of machine learning, which we call transfer learning. In cases where the end processes for a given drug have been fully developed, and all the data from these processes are available, machine learning can allow you to extract common knowledge from those data that then can be specifically readopted for the development of a new process. In many cases, this means you don’t have to perform some of the new experiments for that process development because you can simply transfer what you have seen in the past and perform experiments only to create new or verify information.

The third pillar is optimizing design of experiments and other activities to support the end user in decision making. Today, all the models are deterministic or, technically, “classical statistics,” in which you provide input into the model, and it outputs a number representing its best knowledge about that. In contrast, our work tools are based on Bayesian statistics, which means that every time that we get a prediction from our models, it is not only a number but also a probability distribution. That enables us to integrate risk considerations into our decisions, depending on how much data we possess and how ambitious the decision objective is about a given process under certain conditions.

As a result, a distinctive feature of our approach to developing processes is that we typically do so based on utility. We can aggregate different considerations in our tools that range from constraints on how likely it is that we meet all the quality constraints or improve the productivity of a process, how expensive it is to run certain processes, or how likely a given new experiment is to truly create new knowledge. The user can combine all these different aspects and their corresponding probabilities and risks to come up with a very efficient way to develop processes or to simply manage risks in biomanufacturing.

MS: Another aspect of our value proposition is that all of this is packaged as a user-friendly cloud solution, which facilitates collaboration and enables a team with only very limited experience in modeling (or none) to collaborate on the creation of the predictive model, which is then the engine to answer all their practical questions. Additionally, the tool is fully customized to the needs of the pharma industry. Having worked with several tens of different companies, we have a comprehensive understanding of the key questions they would like answered. With our tool, the final step is not the creation of the model, which is the case for many other software solutions — it’s the decision derived from the model, which follows very practical needs: get more product, understand the process better, understand how to design and scale up the process, and so on.

This collaborative cloud architecture — combined with the very customized way the software is established in terms of workflow — allows us to democratize the use of machine learning across an organization that is conservative and not digital native. The magic happens in the background, but the solution is a user-friendly tool that serves as a bridge to our vision, and machine learning moves from just being a buzzword to a technology used every day — a commodity to create consistent value from the routinely measured data.

This requires that the technology be customized to the problem to be solved and to the users who need it. If you compare the current, third version of our software with the initial zero version, you can see how much more customer centric it has become. We began with what we thought the customer needed but have updated that to what different customers have said they need. Beyond the continuously generalized and diversified software itself, the added value for the users has helped us to evolve our perspective.

AB: Ultimately, the definition of key terms like digital twins can vary significantly from field to field and even individual background to background. An engineer probably does not see a digital twin in the same way a data scientist would define it. But in the end, the digital twin is a tool that allows the vast majority of the stakeholders within a pharmaceutical company to interact with the prior knowledge about the processes to act on an ongoing process without needing to understand the underlying algorithmic details on the digital twin.

DA: Can you discuss the process of a customer adopting your technology and integrating it into their existing process development and manufacturing systems?

AB: We could probably discuss this for two days from 20 different perspectives, but in short, adoption is currently a relatively long and painful path. On one hand, we are developing processes that could not be developed otherwise. In a sense, especially at the beginning of a journey, we provide a very incremental improvement.

We have certain established tools. The challenge is not that the scientists we engage with are unable to understand our tools, but that these tools are not yet well accepted by the broader scientific community, particularly by regulators. Changing tools poses challenges for a pharmaceutical company because they have to restart a lot of their discussions with regulators, who will challenge the approach and underlying tools in depth. With this type of technology, things often really depend on the first adopter, who takes up the challenge of clarifying all the problems for everybody; from that point on, it is a downhill journey. Five years ago, hybrid modeling started to become a topic in such discussions with regulators, and now we see them becoming a more regular interaction point alongside other machine learning techniques that are clearly superior to the old way of using simple statical tools only.

Another challenge is that most of our customers today are focused on process development — in other words, short- to intermediate-term improvements — while the greatest improvements the technology can achieve manifest over the middle to long term. Ultimately, machine learning has the potential to radically change the way we approach every concept, from how we develop a process to how we manage the quality of drugs in a broader sense. Unfortunately, this is very much in the future, but as you can imagine, it is complicating adoption. People tend to be ultra-focused on what is happening tomorrow and the next week rather than what’s over the horizon.

DA: Do you see different responses depending on the nature of the customer you are speaking with? I’d imagine that a small biopharma might be more willing to take risks than a big pharma company, but they might also be far more focused on the near term.

AB: The answer is totally counterintuitive. You would expect the early adopters to be companies really focused on manufacturing, like CMOs, rather than larger pharma companies where manufacturing is just one of many activities. Additionally, you’d expect, as you suggested, that small companies that are more flexible in their procedures would jump on the technology faster than big companies.

In both cases, however, what we have seen is the opposite. In the first case, I think that pharma companies are more interested than CMOs for two reasons. First, pharma is much more willing to invest in R&D, whereas CMOs need to be extremely efficient, with any innovation activity being perceived as a waste of time. The second reason is that it is ultimately difficult for CMOs to adopt innovative technologies without being backed by pharma, because if their pharma customers are not interested in using hybrid models to develop bioprocesses, that’s the end of the discussion.

In the end, the activation energy barrier that you have to overcome to adopt the technology is directly proportional to your degree of digitalization or your readiness to digitalize. The more digitalized you are, the faster you can harvest results from our tools. Larger companies are way more digitalized than smaller companies, who value flexibility over a very standardized way of running things.

MS: In many cases, the key enabler of acceptance for a potential customer is an internal believer in the technology. The believer or influencer may be someone at C-level, but we have also had cases where the believer was a scientist or manager. But in many cases, the identification of a single champion who can open doors has been our best means of acceleration. This believer can be someone who really gets the technology, but it can also just be someone who has seen what we have done for others and become convinced of the value. Then this believer needs to convince a critical mass of people — from CEO to the end users or the other way around — that there is a business case or a clear need. We are still learning how best to segment things into different paths and customize our sales cycle to maximize success regardless of our path into an organization.

It’s also always helpful to have references and tangible results, which can be something created for others. As soon as our software was tangible, discussions became much easier because we could run a demo and show what is possible. But as a thought leader, that’s generally not the case because you’re running in front of the industry with the original vision, which then becomes a prototype, which then becomes more and more of a solution. This critical mass of people believing in the tool and creating a community across organizations was a key enabler for us along the way.

Ultimately, since the benefits can take a while to really manifest, those early believers who invested in the solution first are the ones reaping the most benefits. That doesn’t apply only to our solution; in general, companies who embraced a digital maturity vision combining different digital solutions as part of their assets already started to measure a return on investments last year, whereas the others are following.

DA: Can you expand on how you aim to make the solutions more accessible and user-friendly for a much broad and potentially data-naïve audience who may want the outputs without understanding how they were achieved?

AB: In the end, all the stakeholders associated with manufacturing or the development of manufacturing processes have to interact. Today, without a tool like a digital twin, they have to interact from different perspectives on the processes. The scientist has a perspective, maybe focused on optimization and robustness. The technician has their perspective in terms of organizing the experiments and the data and collecting the results in an organic way. But then there is the QA person, who wants to understand the risks and the regulatory side of things and who is running risk analysis, validations, and so forth. There is the production team that has to control or simply schedule processes. This is the next challenge of our tool: to address all these stakeholders with a platform solution containing specific features that are designed to support each of their individual perspectives.

MS: Another important angle to all of this is the need for a clear mindset change perspective, where we can play an educational role. We need to have different storylines ready for someone with a more stubborn or conservative mindset versus someone who is very open. We are quite active in teaching — less about only what our tool can do but rather how new methodologies compare with old methodologies. Alessandro teaches at university to prepare young chemical engineers to see the value of machine learning, and we run a few courses each year, for which we have had 250–300 industry participants from all the major pharma companies. Independent of the solution they choose, they can learn what to expect from the technology and how to receive answers on practical bioprocessing questions.

In addition to this crucial educational component, we need to help organizations overcome an additional barrier: despite the conviction that it’s the right tool to use, the ideal users are often too busy in the lab to have time for adoption. This requires another shift to allow experimentally focused people to be in front of the computer for sufficient time to learn and use the technology to enable them to improve their work in the lab. If you want to have a return on investment on a digital solution, you need to allow your team to use it for at least a certain number of hours per week.

DA: In your ongoing R&D efforts, are you primarily focused on training the system on new data and creating more user-specific applications, or is there yet work to be done to make the fundamental machine learning technology “smarter?”

AB: That is a very good question, because we went through several different phases. At the beginning, we had to learn how to implement our technology, so we focused on making the hybrid models perform increasingly sophisticated tasks. Today, we are more in a phase of simplifying things. In a sense we are going backwards — not in terms of results but going back to focus on decreasing the barrier to adoption. For that, the tools have to be very simple. Ultimately, we aren’t arguing that we provide the best hybrid of machine learning; we are essentially the only company providing that. Instead, we are contrasting ourselves with the state-of-the-art technology, which is way less efficient but widely adopted.

Additionally, we are taking a few concrete actions, mostly increasing the spectrum of tools that we provide to pharma in two dimensions: different unit operations to cover the entire manufacturing process, possibly including formulation; and modalities, going from therapeutic proteins to mRNA, cell therapies, and so on.

We have a few other innovative initiatives in the works. One is exploring the ability of generative AI to aggregate all the historical data from process development, all the results, and so on, and create an R&D report describing what has been done, how the process is running, all the tests, the history, and so forth, for filing purposes and presentations of these activities, which can be extremely time-consuming. We could also standardize how such reports are created.

DA: Since the company has already evolved considerably, from a more advisory role into solution providers, do you foresee further evolution of DataHow as the technology evolves and your relationships deepen over the coming years?

AB: The scalable parts of the company will always focus on the software. But like many similar companies, the service part will play a growing role. For our customers to get the most out of technology, they have to be supported.

But the most straightforward direction where we will be heading is into manufacturing. Eventually, all these activities and all this knowledge will be transferred to manufacturing to concretely support the full range of the activities for drug production, regulatory, and so on. We will also be involved in enabling technology providers to integrate our technology with what they are producing. For example, a bioreactor producer could integrate our base technology into their software to support the management of the data coming from that platform and normal activities. The same would be true for a company producing sensors — we could aggregate all this knowledge together so they could come up with a package of sensors that are able to capture knowledge and support manufacturing activities.

MS: We want to become a very established provider of this technology in the field. Of course, scalability will come from the software. But our entanglement around changes in mindset and the support of a growing base will help us to stay in touch with where the industry is going, and we will have the advantage of being the first mover in that direction. There are many doors we want to go through in manufacturing. But maintaining long-term relations with the customer is a critical goal for us, and a big part of that will be diversification of the software beyond the launch version. We want to have a full platform solution covering small and large scales and all sorts of unit operations.

Building this platform step by step will allow us to be a very established provider in that space with confirmation that other players are going our way and switching to hybrid model machine learning. That affirms that we are on the right track but reminds us that we need to move quickly and partner smartly.

Originally published on PharmasAlmanac.com on June 18, 2024