The Pareto Principle versus the Long Tail

80/20. In every scientist’s life there is a point when someone points out to you that you should not waste your time and that you should work more efficiently. If that someone, be it your boss, supervisor or close friend with a superior track record, is inclined to resort to management language, you might hear about the Pareto Principle or the Eisenhower matrix. Follow me on a brief motivational blog post that your boss probably doesn’t want you to read – telling you why it is good to keep doing what you are doing. Continue reading

The exome fallacy

Are you fully covered? My experience with a phenomenon I shall call exome fallacy began in 2011. While browsing the exomes of a few patients with epileptic encephalopathies, we wanted to have a quick look at whether we could exclude mutations in the epilepsy gene SCN1A in our patients through exome data. As some of you might already guess, the words “exome” and “exclude” don’t go well together and we learned the hard way that each individual exome covers certain parts of the gene quite well. However, if you expect your exome data to have sufficient quality to cover an entire gene in several individuals, you end up disappointed. But there is even more to the exome fallacy than you might think… Continue reading

Will the relevant SNPs please stand up

The flood of variants.  Every re-sequencing of a genome leads to many more variants than can be validated with functional assays. Many strategies exist to select the candidate variants. Filtering on criteria might remove all variants so efforts are focused to re-rank the list of variants such that the most promising appear on top. A recent review in Nature Reviews Genetics wants to give users a hand with using the bioinformatics tools available. As a bioinformatician, I find a number of important points missing.

Continue reading

Be literate when the exome goes clinical

Exomes on Twitter. Two different trains of thoughts eventually prompted me to write this post. First, a report of a father identifying the mutation responsible for his son’s disease pretty much dominated the exome-related twittersphere. In Hunting down my son’s killer, Matt Might describes his family’s journey that finally led to the identification of the gene coding for N-Glycanase 1 as the cause of his son’s disease, West Syndrome with associated features such as liver problems. The exome sequencing that finally led to the discovery was part of a larger program on identifying the genetic basis of unknown, putatively genetic disorders reported in a paper by Anna Need and colleagues, which is available through open access. This paper is an interesting proof-of-principle study that exome sequencing is ready for prime time. Need and colleagues suggest exome sequencing can find causal mutations in up to 50% of patients. By the way, a gene also that turned up again was SCN2A in a patient with severe intellectual disability, developmental delay, infantile spasms, hypotonia and minor dysmorphisms. This represents a novel SCN2A-related phenotype, expanding the spectrum to severe epileptic encephalopathies.

The exome consult. My second experience last week was my first “exome consult”. A colleague asked me to look at a gene list of a patient to see whether any of the genes identified (there were 300+ genes) might be related to the patient’s epilepsy phenotype. Since I wasn’t sure how to best handle this, I tried to run an automated PubMed search for combination of 20 search terms with a small R script I wrote. Nothing really convincing came up except the realisation that this will be an issue that we will be increasingly faced in the future: working our way through exome dataset after the first “flush” of data analysis did not reveal convincing results. Two terms that came to my mind were bioinformatic literacy as something that we need to improve and Program or be Programmed, a book by Douglas Rushkoff on the “Ten commands of the Digital Age”. In his book, he basically points out that in the future, understanding rather than simply using IT will be crucial.

The cost of interpretation is rising. The Genome Center in Nijmegen suggests on their homepage that by the year 2020, whole-genome sequencing will be a standard tool in medical research.  What this webpage does not say is that by 2020, 95% of the effort will not go into the technical aspects of data generation, but into data interpretation. For biotechnology, interpretation will be the largest marketing sector.

By 2020, probably more than 10 million genomes will have been sequenced. Data interpretation rather than data generation will represent the most pressing issue.

So, what about epilepsy? “50% of cases to be identified” sounds good for any grant proposal that I would write, but this might be a clear overestimate. Need and colleagues used a highly selected patient population and even in the variants they identified, causality is sometimes difficult to assess. We are maybe much further away from clinical exome sequencing in the epilepsies than we would like to admit. The only reference point we have for seizure disorders to date is large datasets for patients with autism and intellectual disability. While some genes with overlapping phenotypes can be identified, we would virtually be drowning in exome data without being capable of making sense of this.

10,000 exomes now. I would like to predict that after having identified some low-hanging fruits with monogenic disorders, 10,000 or more “epilepsy exomes” would have to be collected before making significant progress. It is, therefore, crucial not to be tempted by wishful thinking that particular epilepsy subtypes necessarily have to be monogenic, as in the case of epileptic encephalopathies or other severe epilepsies. Much of the genetic architecture of the epilepsies might be more complex than anticipated, requiring larger cohorts and unanticipated perseverance.