Genomics Data Infrastructure Is the Next Picks-and-Shovels Play

The genomics field has a data problem. Not a shortage of data — the opposite. Clinical sequencing volumes have grown roughly 30% annually for the last five years. Population genomics initiatives in the UK, US, and across Europe have generated hundreds of thousands of whole genome sequences attached to longitudinal clinical records. Research biobanks, hospital sequencing programs, and direct-to-consumer genetic testing companies have collectively created a corpus of human genomic data that is growing faster than the infrastructure to use it.

The challenge is not sequencing. The cost of whole genome sequencing has dropped from $3 billion in 2003 to under $200 today, and it continues to fall. The challenge is everything that happens after the sequencer produces a FASTQ file: data storage, quality control, variant calling, annotation, harmonization across datasets generated by different instruments and analysis pipelines, integration with clinical records, privacy-preserving analysis, and governance systems that allow data to be used for research while protecting patient rights.

These are largely unsolved problems, and they are the rate-limiting step in realizing the clinical and research value of the genomic data that already exists.

The Harmonization Problem Is Bigger Than It Looks

A whole genome sequence generated by one instrument, processed through one analysis pipeline, and annotated with one reference genome is not directly comparable to a whole genome sequence generated by a different instrument, processed through a different pipeline, and using a slightly different reference. The differences are small in absolute terms but meaningful at scale: population-scale analysis that combines datasets from multiple sources will systematically misclassify variants that differ by analysis methodology rather than by actual biology.

This is not a hypothetical problem. It has produced replication failures in large-scale genomic studies where a variant found to be statistically significant in one cohort fails to replicate in another cohort using different sequencing methodology. Some portion of those failures are genuine biology — the variant is population-specific or the original finding was a false positive. Some portion are methodology artifacts. Without harmonization infrastructure that can separate the two, it is hard to know which you are looking at.

The companies building harmonization tools — software platforms that can ingest genomic data from multiple sources, normalize it against common quality standards, re-call variants using consistent pipelines, and output harmonized datasets suitable for cross-cohort analysis — are building genuinely valuable infrastructure. The market for these tools is every research hospital, biobank, pharmaceutical company, and population genomics program that has heterogeneous genomic data it wants to use collectively.

We have spoken to genomics informatics teams at large academic medical centers who describe spending 40-60% of their analytical capacity on data wrangling before any actual biology gets done. That is a real productivity loss with a real market for tools that can reduce it.

Privacy-Preserving Analysis

Genomic data presents unique privacy challenges compared to other health data. A genome sequence is permanently identifying — you cannot change your genome the way you can change a password. The re-identification risk from genomic data, even when demographic identifiers are removed, has been demonstrated convincingly in research settings. This creates a genuine tension between the scientific value of large-scale genomic analysis and the privacy obligations that health systems and researchers carry.

Privacy-preserving computation approaches — federated learning, secure multi-party computation, differential privacy — offer partial solutions. The idea is to enable analysis across distributed genomic datasets without ever moving the raw data to a centralized location, or to add mathematical noise to query outputs that prevents re-identification while preserving statistical signal at population scale.

None of these technologies is fully mature for routine genomic research use, and there are real tradeoffs between privacy guarantee strength and analytical utility. But the regulatory pressure driving genomics data governance — GDPR in Europe, state-level genetic privacy laws in the US, and an increasingly active FDA interest in genomic data handling in the context of clinical applications — is making privacy-preserving analysis infrastructure from a nice-to-have to a must-have for programs that want to work across national or institutional boundaries.

The Interoperability Layer

Clinical genomic data exists in a fragmented landscape of health information systems. The electronic health record systems that house most clinical data were not designed to accommodate genomic data, which is orders of magnitude larger than conventional clinical data and requires specialized query and visualization interfaces. The genomic data management platforms that research institutions use are often custom-built, poorly documented, and not interoperable with each other or with clinical systems.

The HL7 FHIR standards for genomic data representation have been developing for several years and are reaching the point of deployment in a few forward-looking health systems. GA4GH, the Global Alliance for Genomics and Health, has been developing data access standards and interoperability frameworks that are increasingly being adopted by major genomics programs. The standards infrastructure exists; the implementation is the hard part.

Companies building middleware that connects genomic data platforms to existing clinical information systems, implements the emerging standards in ways that are practical for health system IT teams to deploy, and provides clinical-grade genomic data access for point-of-care decision support are addressing a real gap. The customer is every major health system that wants to use clinical sequencing data in its electronic workflows, which is increasingly all of them.

Investment Thesis

Our interest in genomics data infrastructure is direct. The sequencing volumes are growing and will not stop growing. The regulatory and privacy requirements around genomic data are tightening. The scientific value of combining genomic datasets across institutions and populations is well established. The tools to realize that value at scale do not fully exist yet.

The business models are varied: SaaS platforms, managed services for data harmonization, laboratory information management systems with genomics-native architecture, and analytics services built on federated data networks. The common thread is that these businesses benefit from the same secular growth in clinical sequencing volumes that the sequencing instrument companies benefit from — without the capital intensity of hardware development or the clinical trial risk of drug development.

That picks-and-shovels dynamic is one of the more durable investment theses in life sciences. We have deployed against it before in related contexts and we are doing so again here.

Genomics Data Infrastructure Is the Next Picks-and-Shovels Play

The Harmonization Problem Is Bigger Than It Looks

Privacy-Preserving Analysis

The Interoperability Layer

Investment Thesis

Continue Reading

Long-Read Sequencing Is Finally Moving Into the Clinic

What Spatial Transcriptomics Actually Means for Drug Targets

La Jolla's Biotech Ecosystem: A Primer for Founders