Getting the most from RNA-seq in drug discovery
In our line of work—as data scientists and discovery partners to pharma companies and translational labs—we see a lot of sequencing data. It’s not all good data, mind you, far from it. Often the datasets we’re asked to analyze are lacking in quantity, or quality, or both. Sometimes the hardest to deliver but most valuable conclusion we reach is that an experiment needs to be repeated with a different design or data collection strategy.
In this post, I’ll enumerate a number of points one should consider when planning a high throughput sequencing experiment, and discuss implications for downstream analysis. I’ll argue that quality and quantity of (the right kind of) data is not just a matter of resources, but perhaps even more so of a well-framed question and thoughtful experimental design.
Numbers game & sequencing strategy
One of the most common questions we field from our partners is some variation on: “How many samples do I need?” Invariably, I answer with “What do you want to do?” In practical terms, it depends on whether one is interested primarily in hypothesis generation or hypothesis validation. (Spoiler alert: the latter requires more independent replicates). Scientists are quick to jump on knowledge generation for knowledge’s sake, but in an effort to balance curiosity with resource limitation they may generate data that is less than ideal for hypothesis validation. Mind you, there’s nothing wrong in using NGS data just to get a sense of the molecular underpinning of a phenomenon or study subject. But if one means to simply survey the transcriptional landscape, make that a deliberate decision and embrace the shortcomings of the approach.
We often talk to researchers eager to make use of transcriptomics, for example, but unfamiliar with the methods and variations, and too strapped for time to become experts themselves. To illustrate, here are some bifurcations in the decision tree for RNA-Seq:

Another important consideration is sequencing depth. That will depend on the method but also which methods are required to address the research questions (e.g., quantifying differentially expressed genes versus detecting splicing variants). With the ever decreasing per base pair cost of sequencing, depth is less and less of an issue. In fact, we often advise to reallocate the sequencing budget to allow for more replicates sequenced to a lesser depth, rather than the other way around. Understanding the advantages and limitations of different sequencing strategies is crucial to designing efficient experiments.
Getting a sense of the data vs. getting answers from the data
I empathize entirely with researchers who are in the discovery phase of a project and have collected a bit of data to get their creative muscles going, and who indulge in a bit of data dredging. While the results from poking at such data must be kept in context and not over interpreted, these certainly can provide new points of view. If a new hypotheses is what you’re after, one may very well benefit from such an approach… So long as we dispense with the (fundamentally flawed) expectation of finding and validating a hypothesis with the same small dataset.
However, here’s an increasingly common scenario: A biopharma with a promising new drug runs a Phase I clinical trial with up to several dozen patients, and establishes successfully a safe dosing regimen in humans. Meanwhile, they also collect some secondary data including RNA-seq from patient samples. Perhaps they even see a handful of enticing yet preliminary drug responses among this early patient cohort. While waiting to kick off the Phase II trial and focus properly on drug efficacy and adverse effects, they wish to probe the Phase I data to address all sorts of problems, from uncovering efficacy predictors to biomarkers for patient stratification. “So, can you do it?” How about some “fancy modeling”?
While the desire to get more information from existing data is commendable—and the potential to design Phase I trials for more than dose and safety is a hot topic of conversation—the limitations of the data at hand must be taken seriously. I believe the more one is willing to treat the exercise as a hypothesis generation process to inspire future clinical design and preclinical experiments, the more value one can extract from the data.
Another trap is falling for the hype and demanding neural networks and random forests where a critical review by a biostatistician is really what the doctor ordered. (Good news: we’ve got both, talented statisticians and machine learners, on the same team!) Fixation on methods rather than the scientific question invariably brings about disappointment. While there may be other (public) data that can inform potential machine learning efforts, small studies yield datasets the wrong size and shape (too few samples and too many features, e.g. genes) for typical machine learning approaches. First and foremost, relationships between variables, batch effects, and confounders should be critically examined using established statistical approaches.
Finally, AI
We are working on several projects, meanwhile, in which machine learning is actually a valid and valuable approach to data driven discovery. With enough data of the right kind, machine learning can uncover hidden patterns and suggest novel predictive and prognostic biomarkers. Several methods allow us to peek into the black box and learn about the rules that govern the predictions. Interpreting these rules in the context of biology can radically improve our understanding of pathophysiological processes in the studied system. Still, this takes time and labor, and iteration cycles between domain and computational experts.
While we are bullish on the power of machine learning to transform biomedical discovery, we also wish to temper expectations, because data science is hard, like any other branch of science. For example, findings from one model do not always readily transfer between species, systems, or technologies used to generate data. Inherent bias in data distributions and model overfitting are frequent causes of headache, and the garbage in, garbage out paradigm may be even more important in machine learning than in standard statistical modelling. We are excited about the new possibilities brought about with the application of AI in biomedicine, but to do it properly, modelling starts with well articulated questions, careful experimental design and diligently planned data collection.
Extracting more and deeper insights from data has always been at the core of what we do at Genialis. From the outset, we identified three key problems to solve to transform biomedical discovery with data science. The first one we tackled, perhaps counter-intuitively, is the communication and explanation of results. In our minds, even AI is largely a human enterprise, and engaging all stakeholders to collaborate in data interpretation seemed a worthy challenge. Thus, we humbly started with real time interactive visualisations that helped life scientists autonomously explore pre-processed gene expression data, and have grown from there.
The second problem is data management and (pre)processing. We then addressed primary processing and annotation of the data to ensure reproducibility, quality, and integrate-ability. Over the years we’ve built a professional grade software for NGS data analysis and management, and have expanded our reach to lots of different flavors of sequencing.
Today our main focus is on the third piece—curating model-ready datasets to which we can apply the latest and greatest AI algorithms to mine the hidden gems. With a growing number of collaborators in biopharma and translational R&D, we are excited to flex our muscles and apply our expertise in machine learning to the rich datasets each partner brings to bear. The most successful of these projects are part of long term partnerships in which we are able to provide guidance from tip to tail. But no matter at what stage we get involved, we deliver the best possible outcomes with uncompromising scientific integrity.
If you think your research or clinical program can benefit from smarter data analysis, we’re here to help, and here for the long run.
About the Author

Luka Ausec, PhD
VP Scientific Discovery
Luka directs internal R&D and external partner projects, with the common goal of advancing therapeutic discovery through the rigorous application of data science. Luka’s expertise in biology and computational disciplines makes him uniquely adept at innovating solutions at this nexus. He believes a successful discovery process is built on clear lines of communication and unwavering scientific integrity. In addition, Luka oversees the implementation of Genialis’ informatics platform, and manages the team that helps customers engage directly via Genialis software. Luka earned his doctorate in molecular biology and biotechnology at the University of Ljubljana.
About Genialis
Genialis is a data science and drug discovery company focused on new ways to treat disease. Blending computational biology and AI-based methods, Genialis merges and models data at the intersection of clinical and translational medicine. Genialis is trusted by biopharma and big pharma alike, to validate targets, predict biomarkers and optimally position novel drugs. Together, Genialis and its partners are bringing improved solutions to drug discovery to change people’s lives.
For more information, visit www.genialis.com and follow @genialis on LinkedIn and Twitter. Or contact:
Nejc Škoberne
CCO, Genialis, Inc.
info@genialis.com