Now the experiments to find de novo variants for epileptic encephalopathies within the Euroepinomics RES-project are well underway and first data are coming out, it is a good moment to pause and think about what results we can expect, and how these should be interpreted. For this it is very nice that recent large experiments in autism have provided so much useful data. In this post, I will explore what we can expect in experiments in which we perform whole exome sequencing in a group of patients and their parents to identify de novo variants that could be the cause of the disorder.
The case of the innocent gene. Recent experiments in autism (1-4) have shown that when no de novo variants are involved in causing epileptic encephalopathy, the average number of de novo variants that are detected in whole exome experiments is 0.65 to 1.0 per individual, depending on the quality of the data. This would mean that in 37% to 52% of the individuals we would expect to find no de novo variant at all, while we expect to find 14% to 26% of the patients to have more than one de novo variant. If de novo variants play a considerable role in the disease, the average number of de novo variants in patients should be higher.
Finding double hits. Now, if we are sequencing a series of S patients and their parents, and no de novo variants are involved in the disease, we can calculate how often we would find, just by chance, two non-synonymous mutations in the same gene in independent patients. In fact, Sanders et al. (4) did simulations that show how often this would happen. Note it would happen more often than you might expect, considering we have over 20,000 genes. In a set of 50 patients, finding one gene with non-synonymous mutations in two different patients would happen with a probability of about 5%. Because nonsense mutations are much rarer, finding two nonsense mutations in an innocent gene would be much more unlikely. In a set of 50 patients, it would happen only with probability less than 5 * 10-4. Finding three non-synonymous mutations in an innocent gene is also a much rarer event: in a set of 50 patients, it would happen with a probability less than 10-4. So, we can conclude that if we have sequenced a set of 50 patients and their parents, and if we find a single gene with two non-synonymous de novo mutations, this would not constitute significant evidence for the involvement of the gene in the disease. Finding two nonsense mutations or finding three de novo non-synonymous mutations in the same gene would, however, be so unlikely that it would be strong support for the involvement of the gene in the disease. For larger sets of patients, the probabilities can be read from the figures in Sanders et al. or be computed relatively easily.
The case of the guilty gene. But what happens if a de novo mutation in the patient is causing the disease, what to expect then?
This depends on what proportion of the cases has been caused by dominant de novo mutations. I would expect that some proportion of the cases have a non-genetic cause: birth trauma, infection or something else. Also, I expect some cases to be caused by inherited mutations. In a recent publication Tavyev Asher & Scaglia (5) describe twelve known Early Infantile Epileptic Encephalopathies, and three of those are recessive (EIEE3, 10 and 12), while three others are X-linked and usually inherited (EIEE1, 8 and 9). Of course there may be a detection bias for recessive and X-linked genes, but still, inherited causes do exist. So, a proportion of the patients will have inherited or environmental factors causing the disorder. Could this be 30%? The next thing that is important is how many different genes can cause dominant (de novo) forms of epileptic encephalopathy. So far at least seven are known in OMIM (one of them X-linked), but there could easily be as many as 40 in total, I think. Clearly, if there are 40 genes that may cause the disease, and we sequence 40 patients, they each may have a mutation in a different gene. If 30% of the patients have the disease because of other causes, 12 of those 40 patients may not even have a causative de novo mutation (though about half of these 12 may have a non-pathogenic de novo mutation).
A follow-up cohort. Suppose a hit is found in a particular gene, and it is a true hit, how many additional patients should we sequence to find at least one second hit in the same gene with 90% probability? If we assume that 30% of the patients in the follow-up cohort have the disease for other or undetectable causes, the size of the required follow-up cohort can be read from the graph below. If, for example, there are 40 equally mutable genes involved, we would have to sequence 130 additional cases, and if a mutation is found, their parents as well to establish whether the mutation is de novo. If only 10% of the patients have the disease because of other reasons, sequencing 100 follow-up cases should be enough.
Finding double hits
Above, I argued that finding one innocent gene with a double hit just by chance in a discovery set of 50 patients would occur with a probability of ~5%. Guilty genes are much more likely to produce double hits, depending on the number of genes able to cause the disorder and on the proportion of other causes for the disease as argued above. If our efficiency for finding existing de novo mutations is 100%, we can do some rough calculations to see how many double hits we can expect. In the figure below these are shown for the cases when 10 and when 40 genes are involved. One of many caveats though, is that these calculations assume that mutations causing the disorder are equally likely in each of those genes. Experience seems to indicate a different distribution: in patients with epileptic encephalopathy, mutations in some genes are found more often than in other genes. Another thing is that the above numbers are based on rather rough calculations, not on sophisticated simulations. Still, I think having some indication of what could be expected, as sketched above, can guide our experiments and prevent rash conclusions.