The exome fallacy

Are you fully covered? My experience with a phenomenon I shall call exome fallacy began in 2011. While browsing the exomes of a few patients with epileptic encephalopathies, we wanted to have a quick look at whether we could exclude mutations in the epilepsy gene SCN1A in our patients through exome data. As some of you might already guess, the words “exome” and “exclude” don’t go well together and we learned the hard way that each individual exome covers certain parts of the gene quite well. However, if you expect your exome data to have sufficient quality to cover an entire gene in several individuals, you end up disappointed. But there is even more to the exome fallacy than you might think…

Coverage. I will use the term coverage quite often in this post, but what exactly is meant by this? Basically, the building blocks of exome data are the so-called reads. Reads are short sequences that are determined by the next-generation sequencing (NGS) machine. The NGS technologies can produce millions of these reads that cover various parts of the target sequence. An exome, i.e. a panel of a large number of exons in the human genome, might be such a target. Once the reads leave the NGS sequencer, there is no more wet lab involved until researchers decide to validate findings with PCR. The fitting of all reads to the target sequence (alignment), the identification of genetic variants that are different compared to the human reference genome (calling) and any further interpretation of these variants is purely computer-based. A particular base pair in the human genome is usually included in many sequence reads. The number of the reads stretching across a defined base pair is referred to as coverage, a quality criterion for NGS data. While a meaningful interpretation can be achieved with a 5x or 10x coverage in some special cases, many studies use a 20x cut-off. For NGS data used for diagnostics, even higher coverages (100x) are sometimes required. Last but not least, coverage and read lengths might differ between NGS platforms, complicating matters even further.

Figure 1. Circling in on exome data. When comparing the coverage of exons in a particular region across three individuals, the percentage of base pairs with sufficient quality in all three patients is merely 68%.

The 68% solution. Exome data can be quite heterogeneous when reseachers have very specific questions. Figure 1 shows such an example, which was presented by Bobby Koeleman from UMC Utrecht at the EuroEPINOMICS meeting of the 1000 exomes. In this case, a family with three affected children was investigated for mutations in a very specific region that was identified with linkage analysis. We would usually think that we have a good chance of finding the causative mutation with exome data. However, the exomes don’t keep their promise. If we compare the exons in the region of interest, only two thirds of the targeted base pairs have sufficient coverage in all three patients. This means that every third exonic mutation would simply be missed in this region.

Exomes are screening tools. This example tells a very clear and simple story. Exomes are screening tools for sequence, not sequencing tools in the conventional sense. Exomes can never exclude mutations in a particular gene unless we are very sure and thorough about the coverage in the complete coding region. We simply don’t know about false negatives as there is not a gold standard to compare to. In addition, exomes are ripe with false positive findings.

The “Me, Me, Me” genes. Some genes show up in every exome. There are parts of the human genome that are highly polymorphic (i.e. there are many variants) and these genes tend to show up in places where they might blur the otherwise clear look at potentially causative variants. There are lists of these genes available online, but as a brief guideline it is sufficient if you simply ignore all genes starting with MUC… or USP…

The unknown unknowns. It wasn’t the impetus of this post to discourage exome sequencing – quite the contrary. However, exome data should be handled with caution. Even though some published papers make interpretation of exome data look like a breeze, we deal with many unknowns. Simply imagine that many possible high-impact candidate genes might be hiding out in exome data with 19x coverage while you use a 20x cut-off. We simply don’t know what we don’t know.

20 thoughts on “The exome fallacy

  1. Pingback: A new twist on an old gene: EFHC1 in epileptic encephalopathy | Beyond the Ion Channel

  2. Pingback: Exploring samtools – Green Eggs and Ham (*.bam) | Beyond the Ion Channel

  3. Pingback: Less is more – gene identification in epileptic encephalopathies through targeted resequencing | Beyond the Ion Channel

  4. Pingback: ST3GAL3 and exome sequencing in autosomal recessive West Syndrome | Beyond the Ion Channel

  5. Pingback: Traveling beyond the ion channel | Beyond the Ion Channel

  6. Pingback: Epileptic encephalopathies: de novo mutations take center stage | Beyond the Ion Channel

  7. Pingback: Are there incidental findings in exomes that require immediate action? | Beyond the Ion Channel

  8. Pingback: Mutation intolerance – why some genes withstand mutations and others don’t | Beyond the Ion Channel

  9. Pingback: C6orf70, neuronal migration and periventricular heterotopia | Beyond the Ion Channel

  10. Pingback: Beneath the surface – the role of small inherited CNVs in autism | Beyond the Ion Channel

  11. Pingback: “Dark social” or “Who is afraid of email?” | Beyond the Ion Channel

  12. Pingback: Mining GWAS mountains for missing heritability | Beyond the Ion Channel

  13. Pingback: Five questions you should be asking the ILAE Genetics Commission | Beyond the Ion Channel

  14. Pingback: Treatable causes of intellectual disability and epilepsy that you don’t want to miss | Beyond the Ion Channel

  15. Pingback: Microcephaly, WDR62, and how to analyze recessive epilepsy families | Beyond the Ion Channel

  16. Pingback: A polygenic trickle of rare disruptive variants in schizophrenia | Beyond the Ion Channel

  17. Pingback: Living in a post-linkage world, craving knowledge | Beyond the Ion Channel

  18. Pingback: SCN1A – This is what you need to know in 2014 | Beyond the Ion Channel

  19. Pingback: Three reasons why we need a new genetic literacy to understand epilepsy | Beyond the Ion Channel

  20. Pingback: The ARX problem – how an epilepsy gene escapes exome sequencing | Beyond the Ion Channel

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s