Open Source AI Trained on All Life Can Read Your Genome

This is how you do AI for science: open source, trained on real data, solving actual problems that matter.

Evo 2 is a 40-billion-parameter model trained on 8.8 trillion DNA bases from bacteria, archaea, and eukaryotes. It can identify genetic features that are difficult for humans to spot - splice sites, regulatory DNA, disease mutations, protein-coding regions. And unlike most frontier AI, it's fully open: model parameters, training code, inference code, and the entire training dataset.

The challenge with eukaryotic genomes is that they're messy. Genes are interrupted by introns that don't code for anything. Regulatory sequences are scattered across hundreds of thousands of base pairs. The patterns that define important features are statistically subtle - not absolute rules, just tendencies.

That's exactly the kind of problem neural networks excel at: finding patterns in massive amounts of noisy data. Evo 2 was trained on genomes from all three domains of life, learning to recognize features through sheer repetition. If something is evolutionarily conserved across many species, it appears in multiple contexts, and the model learns to identify it.

What makes this genuinely impressive is that the model wasn't fine-tuned on specific tasks. It learned to recognize splice sites, regulatory regions, and disease mutations just from seeing enough genomes. It even figured out that different species use different genetic codes and learned to apply the right code based on what organism it was analyzing.

Compare this to consumer AI hype. Evo 2 isn't claiming to revolutionize everything. It's designed to solve a specific scientific problem: annotating genomes and identifying important features. It's not replacing researchers - it's giving them a tool to accelerate their work.

And because it's fully open source, other scientists can build on it, verify the results, and adapt it to their own research questions. That's how scientific AI should work: transparent, reproducible, focused on actual problems.

The technology is impressive. The question is how many other fields could benefit from this approach instead of chasing hype.