The flood of variants. Every re-sequencing of a genome leads to many more variants than can be validated with functional assays. Many strategies exist to select the candidate variants. Filtering on criteria might remove all variants so efforts are focused to re-rank the list of variants such that the most promising appear on top. A recent review in Nature Reviews Genetics wants to give users a hand with using the bioinformatics tools available. As a bioinformatician, I find a number of important points missing.
The review by is a light read and rather introductory material for scientist who had little contact with systems biology as it explains what machine learning is. I am trying to fight the knee jerk reaction to voice that no one should use these tools in a professional setting if machine learning needs to be explained to them in a little sidebar. If you use such tools to follow up with experiments, which cost time and money, you better ask for a second opinion. And a third.
But trying several tools – as the authors suggest – and using the one that fits feels is bad advice. There’s a better way: sit down to write a specification of how you would rank SNPs. Then compare your list of features to the available sites and use the ones that implement your ideas. The review advertises a collection of databases by the authors of the review that I find indeed useful and well maintained.
And get in touch with the providers of the tools. Most developers of bioinformatic tools are eager to learn what experimentalists think of their methods and want to improve. But before you contact them, read the papers carefully. If you invest time in the communication, you will get a better return. (If your only goal is to decorate your findings, I suspect you won’t read a review anyway.)
Genes vs proteins. Another peculiarity of the review is highlighting the differences between genes, proteins and isoforms. The amalgamation of the concepts is commonly encountered in everything that someone has applied the term systems biology too. Note for instance that the nodes of a protein-protein interaction network are genes. The term protein comes in from the detection method. If proteins would be modeled in such networks, one would have to represent each protein species in the cell – millions of the abundantly expressed genes – and would arrive at a model that is nearly as complex as the whole cell simulation recently published by the Venter institute.
For the EuroEPINOMICS project COGIE, SNPs will be ranked for follow-up work and we won’t use a black box but a transparent and traceable solution to select variants for follow-up.