AI-Driven Target ID: What's Real, What's Hype

Every drug discovery company claims an AI platform now. We looked at what the actual diligence questions should be — the ones that separate genuine computational biology from a slide deck dressed up as a moat.

AI drug target identification screen

We have now reviewed more than forty companies with some version of "AI-driven target identification" in their pitch. The variance in what that phrase actually means is extraordinary. Some of these are genuinely building new computational biology infrastructure that could change how targets get prioritized. A significant number are applying off-the-shelf models to public data and calling the output a platform.

The problem is that both stories look similar at the top of the funnel. Both involve impressive visualizations of multidimensional genomic data, both claim to identify targets that human researchers would miss, and both have founding teams with publications and institutional affiliations that pass initial scrutiny. The difference emerges when you ask specific questions.

The Questions We Ask

The first question is about data. Where does the proprietary data actually come from, and what makes it non-replicable? Publicly available genomic datasets — TCGA, GTEx, UK Biobank, ENCODE — are valuable, but training a model on public data does not create a defensible data moat. If the competitive advantage is supposed to be computational rather than data-based, that requires a very different set of questions about model architecture and validation.

The second question is about the prediction-to-program pipeline. How many AI-generated target hypotheses has the company actually advanced to experimental validation? What fraction validated? What is the false positive rate on their predictions, and how do they know? Every serious platform company should have a developing body of empirical data on prediction accuracy. "We haven't gotten to experimental validation yet" at Series A is a different answer than at Series B.

The third question is about the target output class. Are these genuinely novel targets — gene products with no prior drug discovery attention and strong genetic evidence of disease relevance — or are they variations on known targets in known pathways? The former is harder to generate and harder to validate, but it represents the actual value proposition of computational target ID. The latter is essentially a fast-follower strategy in existing competitive landscapes.

We have found that the most honest founders in this space lead with their false positive rate, not just their hit rate. A company that knows its failure modes has thought carefully about its biology. A company that only presents successes has not.

The AlphaFold Effect

Structure prediction deserves specific attention because it has become a shorthand for "we use AI" in drug discovery that sometimes misrepresents what is actually happening. AlphaFold and its successors are genuinely powerful — the ability to generate high-confidence protein structure predictions at scale has changed structure-based drug design in important ways. But structure prediction and target identification are not the same thing. Knowing a protein's structure does not tell you whether it is druggable, whether it is causally involved in disease, or whether modulating it will produce a therapeutic effect.

Companies that use AlphaFold-derived structures as one input into a broader target ID and prioritization framework are doing something real. Companies that describe AlphaFold use as equivalent to having an AI target ID platform are conflating two different things. We see both regularly.

Genetic Evidence as the Anchor

Our view — developed across multiple portfolio company experiences and competitive analysis — is that genetic evidence of disease relevance is the most reliable anchor for evaluating an AI-generated target. Targets with strong human genetic evidence have historically succeeded in clinical trials at roughly twice the rate of targets without it. That statistical advantage has been replicated across multiple analyses and multiple disease areas.

AI platforms that prioritize targets by integrating GWAS signals, rare variant data, and functional genomics evidence are doing something that is both computationally non-trivial and biologically grounded. The more interesting ones are also pulling in epigenomic data, protein-protein interaction networks, and tissue-specific expression to contextualize which genetic signals are likely to represent actionable biology versus noise.

Platforms that identify targets through expression correlations in disease transcriptomics without integrating genetic validation face a much higher rate of spurious associations. Correlation in disease tissue is interesting. Genetic evidence of causality is what you want before spending two years and significant capital on a program.

What Good Looks Like

The best computational target ID companies we have seen share a few characteristics. First, their data generation strategy is deliberate — they are collecting specific types of patient-derived data that are not publicly available and that are mechanistically relevant to their disease focus. Second, they have a wet lab team that runs systematic experimental validation, not to confirm the AI's outputs uncritically, but to build an honest feedback loop that improves prediction accuracy over time. Third, their target output strategy is integrated with a medicinal chemistry or drug modality view — they are not generating targets in the abstract but generating targets that map to accessible binding sites and tractable chemistry.

That combination is genuinely rare. It requires strong computational biology, strong experimental biology, and strong medicinal chemistry in the same organization, all coordinated around the same target validation pipeline. Most companies are strong in one or two of these and thin in the third. Finding the ones where all three are working together — and where the AI is actually closing the gap between them — is the work.