Taking things apart. Looking for de novo variants using trio exome sequencing is a powerful technique to identify disease-related genes. After having introduced samtools during the last post, this will be post 2/3 in a series on how to perform an analysis of exome data for de novo variants. This time, I would like to take apart the methods that take us from Gigabyte BAM files to small tables with likely variants. So buckle up.
The variant files. In many instances, the data you receive from your genome center is in the format of a variant file (vcf file), highlighting all the variants that were called using a particular algorithm to find variants that are not present in the human reference genome. Methods for variant calling include samtools itself, but also GATK and Dindel. GATK and samtools both call SNPs and indels, while Dindel only identified indels. For an average exome, this variant file will be 80K-90K rows in length. These are the number of variants identified in a single exome. Variant files might be a good opportunity to have a first look at the data after rigorous filtering and annotation. For dominant and recessive disorders, variant files are the primary starting point. For de novo mutations, however, they might be misleading.
Don’t go there. It is tempting to think that variant files might be a good possibility to look for de novo variants. However, experience suggests otherwise. Let’s assume that we align the parental exomes with the proband exome within a trio. We then search the data for variants present in the child that are not present in the parents. It won’t take five minutes until we get frustrated and bored. Hundreds of variants are likely to show up, virtually all of them false positives. We will find old acquaintances like the MUC genes and many genes in segmental duplication, etc. Even stringent filtering will not get rid of all these artefacts. The real de novo mutations will be hidden in there, drowning in genomic noise. Was has happened?
The weather forecast. Each time the weatherman tells you that it will rain today, he does not refer to an ultimate truth, but to a certain probability. The same is true for variant calling algorithms. Next generation sequencing technologies usually cover each variant in the exome with a certain number of reads, i.e. a given position has been “read” several times during the sequencing process. In order to call a particular position a reference base or variant, the calling algorithm must make a choice. If 55 reads suggest “A”, whereas only 5 reads suggest “T”, the variant will be called as “A”. Now suggest a similar situation in the other parent. In the child however, 10 reads are “T” and the variant in the child will be called heterozygous and – when aligning the variant files – will appear as de novo mutation. In reality, however, we might be dealing with either a heterozygous variant in all three individuals or no variant at all, but simply a slight technical imprecision. These artefacts are what we select for when we simply base our de novo calling on aligning variant files. At first glance, strange coincidences like this may appear odd. However, we shouldn’t forget that we’re working with “omics” data, which is short for “every artifact you can possibly imagine”.
Bayesian frameworks. One possibility to improve de novo calling is an analysis that looks at the reads individually. DeNovoGear, the algorithm we use for analyzing the EuroEPINOMICS trios, takes advantage of an algorithm, which is called FIGL. Without going into too much technical detail (that I don’t fully understand, anyway), FIGL uses a Bayesian analysis. This type of analysis stipulated a prior probability that is sequentially adjusted with new data coming in. Using the weatherman example again, we expect rain when the weather forecast predicts it. However, if it doesn’t rain in the morning, by midday or by the afternoon, our suspicion increases that the weatherman might have made a mistake and that it’s not raining at all. Using Bayesian language, we constantly generate novel posterior probabilities by adding further data. The same can be done for de novo variants and the reads at a specific genomic site. Imagine that all reads at position X in the proband or the parents are put into a big box and are drawn one after the other. Initially, before seeing any read, we have a very low probability that position X might be mutated in the proband. >99.9% of the exome are reference sequence and if we find a variant, it is much more likely to be a SNP inherited from the parents rather than a de novo variant. Now let’s follow the sequence of our experiment through. This, of course is a simplification of the models that are used in the algorithms.
Reads, one by one. First, I would like to cheat a little and come back to the above example in a second. Let’s assume for now that we have already drawn all the parental reads, all of which show the reference base. If ~20 bases indicate the same reference genotype, the probability that the parents carry anything else but the reference genotype is very small. We only have to decide between two alternatives based on the reads of the child that are still in the box. The prior probability of a mutation at any site in the human genome is very low (~10x e-8). We start drawing the reads and first draw a “T”. If the real situation is a mutation, this would be the expected outcome and is very likely. If we assume that the underlying genotype is reference, this might be due to error. If we put a number on these probabilities, these conditional probabilities modify the probability, and the posterior probability, adjusted for the data coming out of this read, is slightly higher for a de novo mutation occurring at this site. Now we draw another few “Ts” and we have to further adjust the probability. Following this lucky run, however, we draw four “As” in a row. “A” is the reference base, very likely to be seen if the real situation is reference, but also possible, albeit to lower extent, if we have a heterozygous de novo mutation. However, our probability now needs to be adjusted backwards, decreasing the possilibity of a de novo variant again. Now let’s fastforward and assume that half of all reads turned out to be “T” in a total of 40 reads. This modified the probability of a de novo mutation to such an extent that it is now much more likely than the reference sequence. Even though this situation had a difficult uphill battle to fight given the minuscule prior probability, the data made us adjust the probability.
Now, the parents. But how can a Bayesian analysis help prevent the situation as indicated in the earlier example? If a parental read turns out to be “T” rather than “A”, this will make us adjust towards a possible transmitted SNP rather than a de novo variant or reference sequence. In fact, the number of reads mentioned above would prevent the posterior probability of a de novo mutation to reach any range that would be noticed. We do not have to care whether the position is in fact a SNP, all we care is about is the question to what extent our probabilities are modified. By taking into account the single reads rather than called genotypes, this information can be integrated into our decision.
Bayes & EuroEPINOMICS. DeNovoGear is so far the unbeaten algorithm, performing much better than other algorithms and techniques that we have used to find de novo variants. Up to now, we don’t have any de novo variant identified that was missed by this program. Naturally, this algorithm has difficulties in sites where the coverage is low, as our priors can only be adjusted for a fewer number of times. Either way, there are still a number of false positive findings, which suggests that algorithms for de novo calling may still be developed further.