Exome sequencing in epileptic encephalopathies – the powers that be

The power, over and over again. I must admit that I am thoroughly confused by power calculations for rare genetic variants, particularly for de novo variants that are identified through trio exome sequencing. Carolien has recently written a post about the results we can expect from exome sequencing studies. For a current grant proposal, I have now tried to estimate the rate of de novos using a small simulation experiment. And I have realized that we need to re-think the concept of power.

Power and errors. The idea of statistical power is often used, referring to studies that are “sufficiently powered” to detect difference. However, I always need to think twice when I try to explain this concept to somebody, as it is not easy to grasp. Power has something to do with testing errors and hypotheses, so please give me a minute to review these concepts first. If you don’t feel that you need a review on these topics, please skip to the last sentence of this paragraph. Doing statistics sometimes involves spinning around a countless number of times until you don’t remember anymore where you eventually came from. This, for example, is the case for the p-value. Just imagine you are interested in the height of men and women. Even though you already assume that men are taller than women, statistics require that you turn a blind eye on this. You have to assume that both groups are equally tall (your null hypothesis) only to be surprised that they are not. The level of surprise is your p-value. The critical element in this scenario is the null hypothesis, as you can make two different mistakes pertaining to this. Depending on your criteria, you can reject the null hypothesis even though it’s true (Type I error), i.e. you conclude that men and women differ in size, even though they are equally tall. Alternatively, you might conclude that men and women are equally tall, even though they are not. This is a Type II error. And the Type II error (b) is closely related to the concept of power (1-b). In brief, power is the probability to find a difference if it is actually there. However, if we search for de novo variants in epileptic encephalopathies, what exactly is the difference that are we looking for?

Where is the difference? We have reviewed many of the recent papers on de novo mutations in autism, schizophrenia and intellectual disability and –in a certain way- all these papers convey the same message: every individual, affected or nor, carries between 3-5 de novo variants (“innocent genes”). In patients with neurodevelopmental disorders, some of these variants are causative or contributory to disease (“guilty genes”). How can we tell these two classes apart? On a group level, we would expect a higher frequency of de novo variants in patients with neurodevelopmental disorders or epileptic encephalopathies. However, the recent studies have shown that sample sizes need to be very large to detect this difference, coming back again to the concept of power. A sample size of ~200 trios is probably sufficient to find a difference in frequency of de novo mutations with a p-value of 0.05. Looking through our GWAS goggles, this is troubling. We are used to much more stringent p-values in genetics. If we then include a comparison of mutation subclasses either based on prediction or putative gene function, we will soon run into a multiple testing problem. In addition, these results on a group level don’t tell us anything about the actually genes. Therefore, another criterion is frequently used to assess the role of de novo variants – recurrence. Genes are accepted to be implicated in the etiology of a disease if they are found to be mutated in more than one patient. For example, SCN2A is implicated in autism as several patients with autism were found to have de novo mutations in this gene. Compared to the number of genes found to be mutated, the number of recurrent genes is much smaller. As recurrence instead of group difference appears to be the most straightforward way of judging the role of particular genes, we have to rethink our way of talking about the power of a study.

The null hypothesis, revisited. What is our null hypothesis, what is our p-value for recurrence of de novo mutations in epileptic encephalopathies? The most obvious answer is quite simple. There is no such thing. We don’t compare groups, we don’t perform a statistical test. If we judge the relevance of genes simply by recurrence, we need to address this issue differently. One solution for this is to estimate the number of samples needed to find a gene at least twice.

An issue of architecture. We have tried such an estimation using a small simulation. The ingredients we need for this are (a) the frequency of a given disease variant in patients and (b) the number of genes involved. For our simulation, we assumed that 400 genes contribute to the disease and that each gene has a frequency of 0.2% in the patient population. This, at least to us, might be a reasonable guess for the genetic architecture of the epileptic encephalopathies modeled after what is known in autism: many variants contribute to the disease and mutations within a given gene are rare. Assuming that each of these 400 genes is mutated independently at a frequency of 0.2%, different constellations can be expected. We simulated these constellation 1000 times, each time allowing each gene to be mutated within a patient with a probability of 0.2%. We then counted the number of genes that were mutated at least twice in a cohort of 100 patients (Figure).

Estimating the number of genes found to be mutated at least twice in patients in epileptic encephalopathies assuming 400 risk genes with a frequency of 0.2% each. In more than 80% of simulations, 5 or more genes will be identified in a cohort of 100 trios.

The bandwidth of possibilities. In summary, the number of genes occurring at least twice has a median of 7. This means, using the parameters outlined above, we expect 7 or more genes to be mutated in at least two patients in most simulations, i.e. of the 400 genes involved, our cohort of 100 patients (actually trios, as we look for de novo variants) will pick up a few. In 80% of experiments, at least 5 genes are found to be recurrent, which may be proxy for the 80% power that is customary in genetic association studies.

16 thoughts on “Exome sequencing in epileptic encephalopathies – the powers that be

  1. Very nice to see the distribution of multiple hits and not just the average. I think the null hypothesis would be that no genes are involved in the disease, whereas the alternative hypothesis would be that 1 or more (or some specific number) genes would be involved in the disease. Under the null hypothesis, there would still be de novo variants found in the sample, but double hits would be rare and triple hits very rare. An example of this null-distribution can be found in Neale et al (2012) in Nature, supplementary table 7. It shows that if 100 de novo variants are found, 0.6 genes will have double hits just randomly. Similarly, Sanders et al (2012) show in their figure S7 expectations of double and triple hits under various hypotheses, but now split by missense and nonsense variants. Using the null distribution, one can determine whether a double hit is sufficient evidence for the involvement of a particular gene. Assuming equal sequencing efficiency etc. for Neale and RES, if seven double hits would be found in a set of 100 de novo variants in the RES experiment, I would guess that on average six of them are probably true disease genes, and the last one may be true or not.

    • Thanks for the comment. I think that it is an interesting way to look at this and I am curious how this entire field of “de novo power studies” will evolve with larger number of samples sequenced. I think the bottom line of your calculations and the calculations by Neale (2012) and Sanders (2012) is that some random double hits may occur given the null distribution. This basically means that 5-10% of all genes with recurrent de novo mutations are not to be trusted. However, as there is no replication, we have no way of telling noise from true positives.

  2. Pingback: Validation of rare variants – the power of finding anything at all | Beyond the Ion Channel

  3. Pingback: Validation of rare variants – the power of finding anything at all | channelopathist_testing

  4. Pingback: De novo mutations in Infantile Spasms and Lennox-Gastaut Syndrome | Beyond the Ion Channel

  5. Pingback: Red Johanna Day – The signal and the noise | Beyond the Ion Channel

  6. Pingback: Hypermutability of autism genes: lessons from genome sequencing | Beyond the Ion Channel

  7. Pingback: Less is more – gene identification in epileptic encephalopathies through targeted resequencing | Beyond the Ion Channel

  8. Pingback: EuroEPINOMICS-RES reloaded – reinventing a consortium | Beyond the Ion Channel

  9. Pingback: Transmission of rare variants in parent-offspring trios – power or no power? | Beyond the Ion Channel

  10. Pingback: Epileptic encephalopathies: de novo mutations take center stage | Beyond the Ion Channel

  11. Pingback: G proteins, GNAO1 mutations and Ohtahara Syndrome | Beyond the Ion Channel

  12. Pingback: Mutation intolerance – why some genes withstand mutations and others don’t | Beyond the Ion Channel

  13. Pingback: Infantile Spasms/Lennox-Gastaut genetics goes transatlantic | Beyond the Ion Channel

  14. Pingback: Modifier genes in Dravet Syndrome: where to look and how to find them | Beyond the Ion Channel

  15. Pingback: Surrendering to genomic noise – de novo mutations in schizophrenia | Beyond the Ion Channel

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s