How to detect de novo mutations in exome data

Taking things apart. Looking for de novo variants using trio exome sequencing is a powerful technique to identify disease-related genes. After having introduced samtools during the last post, this will be post 2/3 in a series on how to perform an analysis of exome data for de novo variants. This time, I would like to take apart the methods that take us from Gigabyte BAM files to small tables with likely variants. So buckle up.

The variant files. In many instances, the data you receive from your genome center is in the format of a variant file (vcf file), highlighting all the variants that were called using a particular algorithm to find variants that are not present in the human reference genome. Methods for variant calling include samtools itself, but also GATK and Dindel. GATK and samtools both call SNPs and indels, while Dindel only identified indels. For an average exome, this variant file will be 80K-90K rows in length. These are the number of variants identified in a single exome. Variant files might be a good opportunity to have a first look at the data after rigorous filtering and annotation. For dominant and recessive disorders, variant files are the primary starting point. For de novo mutations, however, they might be misleading.

Don’t go there. It is tempting to think that variant files might be a good possibility to look for de novo variants. However, experience suggests otherwise. Let’s assume that we align the parental exomes with the proband exome within a trio. We then search the data for variants present in the child that are not present in the parents. It won’t take five minutes until we get frustrated and bored. Hundreds of variants are likely to show up, virtually all of them false positives. We will find old acquaintances like the MUC genes and many genes in segmental duplication, etc. Even stringent filtering will not get rid of all these artefacts. The real de novo mutations will be hidden in there, drowning in genomic noise. Was has happened?

The weather forecast. Each time the weatherman tells you that it will rain today, he does not refer to an ultimate truth, but to a certain probability. The same is true for variant calling algorithms. Next generation sequencing technologies usually cover each variant in the exome with a certain number of reads, i.e. a given position has been “read” several times during the sequencing process. In order to call a particular position a reference base or variant, the calling algorithm must make a choice. If 55 reads suggest “A”, whereas only 5 reads suggest “T”, the variant will be called as “A”. Now suggest a similar situation in the other parent. In the child however, 10 reads are “T” and the variant in the child will be called heterozygous and – when aligning the variant files – will appear as de novo mutation. In reality, however, we might be dealing with either a heterozygous variant in all three individuals or no variant at all, but simply a slight technical imprecision. These artefacts are what we select for when we simply base our de novo calling on aligning variant files. At first glance, strange coincidences like this may appear odd. However, we shouldn’t forget that we’re working with “omics” data, which is short for “every artifact you can possibly imagine”.

Bayesian frameworks. One possibility to improve de novo calling is an analysis that looks at the reads individually. DeNovoGear, the algorithm we use for analyzing the EuroEPINOMICS trios, takes advantage of an algorithm, which is called FIGL. Without going into too much technical detail (that I don’t fully understand, anyway), FIGL uses a Bayesian analysis. This type of analysis stipulated a prior probability that is sequentially adjusted with new data coming in. Using the weatherman example again, we expect rain when the weather forecast predicts it. However, if it doesn’t rain in the morning, by midday or by the afternoon, our suspicion increases that the weatherman might have made a mistake and that it’s not raining at all. Using Bayesian language, we constantly generate novel posterior probabilities by adding further data. The same can be done for de novo variants and the reads at a specific genomic site. Imagine that all reads at position X in the proband or the parents are put into a big box and are drawn one after the other. Initially, before seeing any read, we have a very low probability that position X might be mutated in the proband. >99.9% of the exome are reference sequence and if we find a variant, it is much more likely to be a SNP inherited from the parents rather than a de novo variant. Now let’s follow the sequence of our experiment through. This, of course is a simplification of the models that are used in the algorithms.

Reads, one by one. First, I would like to cheat a little and come back to the above example in a second. Let’s assume for now that we have already drawn all the parental reads, all of which show the reference base. If ~20 bases indicate the same reference genotype, the probability that the parents carry anything else but the reference genotype is very small. We only have to decide between two alternatives based on the reads of the child that are still in the box. The prior probability of a mutation at any site in the human genome is very low (~10x e-8). We start drawing the reads and first draw a “T”. If the real situation is a mutation, this would be the expected outcome and is very likely. If we assume that the underlying genotype is reference, this might be due to error. If we put a number on these probabilities, these conditional probabilities modify the probability, and the posterior probability, adjusted for the data coming out of this read, is slightly higher for a de novo mutation occurring at this site. Now we draw another few “Ts” and we have to further adjust the probability. Following this lucky run, however, we draw four “As” in a row. “A” is the reference base, very likely to be seen if the real situation is reference, but also possible, albeit to lower extent, if we have a heterozygous de novo mutation. However, our probability now needs to be adjusted backwards, decreasing the possilibity of a de novo variant again. Now let’s fastforward and assume that half of all reads turned out to be “T” in a total of 40 reads. This modified the probability of a de novo mutation to such an extent that it is now much more likely than the reference sequence. Even though this situation had a difficult uphill battle to fight given the minuscule prior probability, the data made us adjust the probability.

The first round of a Bayesian analysis. Prior to any data, the prior probability of a de novo variant is very low (~1 x 10 e-8). If a read in the proband is considered that suggests a mutation rather than reference sequence, this will modify the probability. In fact, the posterior probability is 100x more likely that before. If multiple rounds with similar results are performed, the posterior probability of the reference sequence decreases (red), while the probability of an underlying mutation increases (blue). In each case, the posterior probability will become the adjusted prior for the next round.

The first round of a Bayesian analysis. Prior to any data, the prior probability of a de novo variant is very low (~1 x 10 e-8). If a read in the proband is considered that suggests a mutation rather than reference sequence, this will modify the probability. In fact, the posterior probability is 100x more likely than before. If multiple rounds with similar results are performed, the posterior probability of the reference sequence decreases (red), while the probability of an underlying mutation increases (blue). In each case, the posterior probability will become the adjusted prior for the next round.

Now, the parents. But how can a Bayesian analysis help prevent the situation as indicated in the earlier example? If a parental read turns out to be “T” rather than “A”, this will make us adjust towards a possible transmitted SNP rather than a de novo variant or reference sequence. In fact, the number of reads mentioned above would prevent the posterior probability of a de novo mutation to reach any range that would be noticed. We do not have to care whether the position is in fact a SNP, all we care is about is the question to what extent our probabilities are modified. By taking into account the single reads rather than called genotypes, this information can be integrated into our decision.

Bayes & EuroEPINOMICS. DeNovoGear is so far the unbeaten algorithm, performing much better than other algorithms and techniques that we have used to find de novo variants. Up to now, we don’t have any de novo variant identified that was missed by this program. Naturally, this algorithm has difficulties in sites where the coverage is low, as our priors can only be adjusted for a fewer number of times. Either way, there are still a number of false positive findings, which suggests that algorithms for de novo calling may still be developed further.

28 thoughts on “How to detect de novo mutations in exome data

  1. A brief update. In an earlier version of this post, I suggested that samtools only calls SNPs. In fact, samtools works for both SNPs and indels and I have changed this accordingly.

  2. Pingback: Pushing the button for the next exome sequencing round | Beyond the Ion Channel

  3. Pingback: One in four – the carrier rate of recessive diseases | Beyond the Ion Channel

  4. Pingback: Axiomatic – identifying a novel epilepsy gene that was hidden right before your eyes | Beyond the Ion Channel

  5. Pingback: PGAP2 mutations and intellectual disability with elevated alkaline phosphatase | Beyond the Ion Channel

  6. Pingback: Cold fusion – joining exome datasets to identify autism genes | Beyond the Ion Channel

  7. Pingback: Dealing with the genetic incidentaloma – the ACMG recommendations on incidental findings in clinical exome and genome sequencing | Beyond the Ion Channel

  8. Pingback: Exome sequencing in epileptic encephalopathies – a classification of de novo mutations | Beyond the Ion Channel

  9. Pingback: Less is more – gene identification in epileptic encephalopathies through targeted resequencing | Beyond the Ion Channel

  10. Pingback: EuroEPINOMICS-RES reloaded – reinventing a consortium | Beyond the Ion Channel

  11. Pingback: Transmission of rare variants in parent-offspring trios – power or no power? | Beyond the Ion Channel

  12. Pingback: Traveling beyond the ion channel | Beyond the Ion Channel

  13. Pingback: Three things the beach taught me about science | Beyond the Ion Channel

  14. Pingback: Mutation intolerance – why some genes withstand mutations and others don’t | Beyond the Ion Channel

  15. Pingback: C6orf70, neuronal migration and periventricular heterotopia | Beyond the Ion Channel

  16. Pingback: SpotOn London, Open Access and the Higgs boson | Beyond the Ion Channel

  17. Pingback: From unaffected to Dravet Syndrome – extreme SCN1A phenotypes in a large GEFS+ family | Beyond the Ion Channel

  18. Pingback: Infantile Spasms/Lennox-Gastaut genetics goes transatlantic | Beyond the Ion Channel

  19. Pingback: Story of a genetic shape-shifter: SCN2A in benign seizures, autism and epileptic encephalopathy | Beyond the Ion Channel

  20. Pingback: Modifier genes in Dravet Syndrome: where to look and how to find them | Beyond the Ion Channel

  21. Pingback: Mining GWAS mountains for missing heritability | Beyond the Ion Channel

  22. Pingback: CACNA2D2, the ducky mouse, and what it takes to be an epilepsy gene | Beyond the Ion Channel

  23. Pingback: Surrendering to genomic noise – de novo mutations in schizophrenia | Beyond the Ion Channel

  24. Pingback: Microcephaly, WDR62, and how to analyze recessive epilepsy families | Beyond the Ion Channel

  25. Pingback: A polygenic trickle of rare disruptive variants in schizophrenia | Beyond the Ion Channel

  26. Pingback: 9 things you didn’t know about bioinformatics | Beyond the Ion Channel

  27. Pingback: The return of the h-current: HCN1 mutations in atypical Dravet Syndrome | Beyond the Ion Channel

  28. Pingback: The Channelopathist has left the building – here are our top ten posts of the last two years | Beyond the Ion Channel

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s