← Back to Library

Insights from sequencing nearly a million exomes

Happy Saturday! My colleagues at the Regeneron Genetics Center (RGC) recently published an important paper in Nature, describing an analysis of exome sequencing data of nearly one million humans (n=983,578). The impressive part of this work is the exomes were combined to perform variant calling in a single dataset. It was not an easy task, as it involves many technical and computational challenges. My colleague Suganthi Balasubramanian and her computational biology team, who main led this work at the RGC, accomplished it elegantly. I wrote a Twitter post on Monday, summarizing the major findings, which I am sharing it here with some additional thoughts.

Whenever I read a new genetics paper, I travel back in time and glance through old papers on the same topic to appreciate what was known back then and how the knowledge has evolved over time. It gives some important perspective to appreciate the value of current work.

Tracing through large-scale exome sequencing studies in the literature, I found a Science paper from 2012 on the National Heart Lung and Brain Institute (NHLBI) Exome Sequencing Project. It was based on whole exome sequencing data of 2,440 humans, which is 0.25% of the current sample size. Just in a span of 12 years, we have been able to scale the sample size by 400-fold, thanks to the exponentially decreased sequencing cost over the past decade and large-scale investment in human genetics by large biotech companies like Regeneron.

Sequencing 2,440 participants, the NHLBI team have identified ~500,000 single nucleotide variants (SNVs). Our genome is approximately 3 billion base pairs long, ~1% of which, that is, 30 million base pairs, is typically captured by whole exome sequencing. So, in 2012, researchers were able to capture, on average, one spelling error per 60 base pairs. You’ll be impressed to find how these statistics have changed now with the sequencing of ~980k humans.

A mosaic tile design composed of millions of human silhouettes forming the shape of a right-handed DNA double helix. The silhouettes are intricately detailed and packed closely together, but with desaturated colors to create a more subdued and monochromatic appearance. The DNA shape is clearly recognizable with its helical structure and twisting ladder. The background is a contrasting shade of gray to make the DNA structure stand out, enhancing the overall intricate and refined appearance of the mosaic tile.
One million humans silhouettes tiled in a shape of a DNA, as imagined by DALL-E

Background of RGC

Regeneron Genetics Center (RGC) was established in 2014 just on time when major pharma companies started entering into the human genomics playfield. Last year, RGC celebrated its 10th year anniversary. I've written about the origin story of RGC before. ​

The business model of RGC is simple and efficient. It collaborates with academic institutions across the world and provide sequencing as free service in exchange for access to genotypic

...
Read full article on GWAS Stories →