The river of genetic variants. The era of high-throughput sequencing has given us several unexpected insights into the human genome. One of these insights is the observation that mutations or variations can occur in parts of our genome without any major consequences. Every individual is a “knockout” for at least two genes in the human genome. This means that in every individual, both copies of a single gene are disrupted through mutations or small deletions or duplications. In addition, there are dozens, if not hundreds, of genes with disruptive mutations that affect only a single copy of the gene. Similar mutations in specific disease-associated genes, however, will invariably result in an early onset genetic disorder. This comparison already shows that the genes in the human genome differ with respect to the amount of disruptive genetic variation they can tolerate. A recent study in PLOS Genetics now tries to catalogue the genes in the human genome by assessing their mutation intolerance based on the genetic variation seen in large-scale exome datasets. Many genes for neurodevelopmental disorders are highly intolerant to mutations. Furthermore, some genes for monogenic epilepsies show surprising results in this assessment.
The problem. Using high-throughput sequencing technologies such as exome sequencing, a plethora of data is generated, and several thousands of rare genetic variants may factor into the analysis of each exome. Some study designs may significantly reduce the amount of resulting data; for example, in trio exome sequencing for de novo mutations, only mutations that are new in the child and not present in the parents are assessed. But even with these technologies, interpretation is difficult. Every individual has de novo mutations in the coding region of at least two genes. Identifying pathogenic variation for neurodevelopmental disorders such as epilepsy therefore requires additional information. This additional information may be the fact that the disrupted gene has been observed in similar cases before or that this gene represents a prime candidate based on functional considerations. Petrovski and collaborators now add a novel aspect to this story by classifying genes based on their mutation intolerance.
Mutation intolerance. The basic idea of the study by Petrovski and collaborators can be summarized in a single figure (see above). Basically, the authors assessed the existing variation in ~17,000 genes in the 6500 exomes in the Exome Variant Server. Next, they compared the number of common mutations that probably affect gene function versus the number of all genetic variants per gene. For example, in the case of DYNC1H1, there are barely any functionally relevant variants compared to the overall genetic variation in this gene. On the opposite, in MUC17, there is a high number of stop mutations, splice site mutations, etc. compared to the overall genetic variation in this gene. Based on this distribution, Petrovski and collaborators derived a Residual Variation Intolerance Score (RVIS). An RVIS < 0 means that a gene has fewer common functional mutations that expected; an RVIS > 0 indicates that a given gene has a comparatively high frequency of mutations that affect function. Based on this score, all genes in the human genome were ranked, and the results were both comforting and surprising.
Tolerant and intolerant genes. The ranking of all genes is provided as an Excel Table, and browsing through this list starting at the bottom (with the most mutation tolerant genes) made me stumble upon the usual suspects. MAGEC1, MUC16, HLA-A, PRAMEF2, MAGEC1, CRIPAK, HLA-B, CMYA5, MUC5B and MUC17 are genes that we often encounter in exome studies, and these genes are now officially the most mutation tolerant genes in the human genome. Therefore, a role of these genes in disease is unlikely. On the other end, many genes for Mendelian disorders rank amongst the most mutation intolerant genes. This is particularly true for early-onset neurodevelopmental disorders such as the epileptic encepholpathies. CDKL5, STXBP1, SPTAN1, SCN1A, KCNQ2, PCDH19, SCN2A, SCN8A, GRIN2A and KCNT1 are mutation intolerant and are amongst the top 10-15% genes in the intolerability ranking. This indicates that the mutation intolerance scoring system devised by Petrovski and collaborators has the potential to pick out genes for severe neurodevelopmental disorders. Some aspects of this assessment have already been used in the studies of the Epi4K consortium on epileptic encephalopathies.
Surprising findings. Some genes for human epilepsies are amongst the genes with the highest (not lowest) mutation tolerability score. These genes include SCN1B (83rd percentile), EFHC1 (93rd Percentile) and GPR98 (99.9th percentile). GPR98, also known as MASS1 or FEB4 gene is both a very large gene and very mutation tolerant. In light of the findings by Petrovski and collaborators, the actual role of some of these genes in monogenic epilepsies may possibly be revisited. TTN is another gene that is highly mutation tolerant. TTN codes for Titin, the largest gene in the human genome with 363 exons, is a frequent encounter in many exome studies. The fact that this gene is extremely large and mutation-tolerant may contribute to this phenomenon.
Sneak peak. As a brief disclaimer, I have only focused on particular aspects of the study by Petrovski and collaborators, which also extends further into comparison with other disorders, functional scores and evolutionary aspects. However, I felt that the table ranking the genes in the human genome by their mutation tolerability is the center piece of this study. This table is extremely useful in the assessment of de novo mutations or autosomal recessive mutations derived from exome data. We already know many of the common suspects such as the MUC genes that repeatedly pop up in exome studies. However, finding that your top candidate is in good company amongst the top 10-15% of mutation intolerant genes, is reassuring.