How much? Last week, we discussed the probability of finding de novo variants in patients with epileptic encephalopathies, but our calculations were only half the story. Genes that are identified through genome-wide sequencing technologies are often validated in additional cohorts. In many cases, we will only be able to establish a given gene as causative if we find another patient with a mutation in this gene. I was therefore asked to write an additional post on power calculations for rare variants in validation cohorts. Let me tell you the story how I stumbled across a little bit of almost forgotten high school math.
Again, the power. In our previous post, I pointed out that power studies for rare variants are different from the power calculations that we know from association studies for common variants. In brief, for rare variants, we are happy if we find anything at all. We are interested in the probability of finding at least two de novo variants in a given gene or, in the case of validation cohorts, at least a single mutation to show that the gene under investigation is involved. As we assume a monogenic model, there is no odds ratio or effect size; we simply don’t expect mutations of this gene in controls.
Back to school. I have voiced my criticism of Excel in the past, but I must admit that I came back to using it for the graph in the figure. I had already powered up the R package to write some scripts when I realized that the answer to the question I was looking for is actually quite simple, something that I remember from high school math. For example, a typical question might be the following: we have a cohort of 500 patients and we would like to know the probability that a gene with a frequency of 0.001 is found at least once. We assume that mutations in this gene are rare in patients (one in one thousand) and we would be happy to find it at least once. This probability can be derived by calculating the probability that this gene is not found in 500 subsequent patients (0.999^500 = 0.61) and substracting this from 1 (0.39). Accordingly, the probability that a gene with a frequency of 1/1000 is found in a cohort of 500 patients is 39%. These probabilities can be calculated for different sample sizes and cohort sizes.
Null hypothesis and power. How can we interpret this probability in terms of the power studies we know from association studies? Basically, our null hypothesis is that a given gene is not involved in the disease. We would like to reject this null hypothesis as the gene is involved and we are interested in situations when we fail to reject it. In these cases, we would commit a type II error. This would be the case if we did not find this gene in our cohort, i.e. the 61% probability mentioned above. Power refers to the probability of rejecting the null hypothesis if it is false, i.e. finding the gene at least once if it is involved in the disease. Therefore, the 39% probability mentioned above is the power of our study.