Lessons. Today was the first day of our bioinformatics workshop in Leuven, Belgium. We started out with some basic command line programming and eventually moved on to working with R Studio. What is this all about? It’s about getting some basic understanding of what your computer does and how your computer handles files. It’s about good data and bad data and losing the fear of the command line. We collected responses from the participants today about today’s take home messages.
When to call the bioinformatician. Life science and particularly genome science is becoming a data-driven field. We are slowly developing a new understanding on how to live and work in a scientific surrounding that is virtually flooded with data. We will always be required to do some of the sifting, filtering, interpretation by ourselves, but is there a general rule-of-thumb what you should know and what you don’t have to know? This is the uncomfortable truth: there is no boundary. You are always better off understanding the data you are dealing with from scratch. The good news is that understanding basic bioinformatic techniques is becoming more and more achievable. Tomorrow, for example, we will have first tutorials dealing with raw sequence data. I had always felt that everything more complex than BAM files is way beyond my reach.
Losing the fear of the command line. Some of the participants commented that our basic command line tutorials today helped them with their fear of this magic little black window that opens up when you click on Terminal (on Mac or Linux). In fact, working with basic Unix commands jumping up and down in directories, creating, manipulating and deleting files or retrieving pieces of information from gigabyte files in a split second using the grep command is kind of rewarding. I am actually very happy that we got these comments unprompted from some participants – there are a few new bioinformaticians in the making!
Installation is half the job. No matter how well you test-run you tutorial, there is always one thing that you didn’t think of. And being able to solve this issue with everybody in the room is another little success of such a tutorial. In fact, as one of our tutors mentioned earlier today, installing a program is half the job. You learn quite a bit by all the trial-end-error that you have to go through when getting a program to work.
WOW and not-so-WOW people. A few of our participants have already come to the realization that there are people who get easily fascinated by everything computer and people who don’t. Some have also indicated that they have come to a new appreciation of why the prefer lab work. My comment on this: fascination is nothing that you have or don’t have, but is a matter of investment. If you feel invested into having a certain program running on your computer, you develop fascination almost automatically. I would suggest a healthy mixture of peer pressure, continuous praise and time – there are the magic 10,000 hours that you need to spend prior to calling yourself an expert.
R is complicated, but also a calculator. We covered both extremes of what our Linux laptops can achieve in terms of environmental complexity, starting with the barren surface of the Unix shell to the complicated graphical user interface of R Studio. R is a statistical programming language with exquisite graphic features that is very flexible. Some refer to R as the lingua franca of statistical programming. Getting started with R is not very easy as you need to get acquainted with different sorts of variables, tables dataframes and error messages. Nevertheless, R can show you things that you didn’t expect.
You can draw pedigrees with R. Packages expand the basic functionality of R by providing additional capacity for specific user requirements. There are R packages dealing with gene expression, exome data etc. And there are also packages such as kinship2 that allow you to manage and visually pedigree data. And it is very satisfying when you realize that you have turned a simple table into a nicely designed pedigree simply by writing a few lines of code.
Yes, it’s rewarding. That’s a response that we got from probably half the participants today. One way that programming rewards you is by giving you a feeling of success. Even if you start with a simple one liner at the Unix command line – it is difficult not to have some feeling of success when you realize that you have taken control of your computer. There were some concerns by a few participants that you don’t learn much about the foundations of programming by working on specialized problems. I think that the opposite is true. Problem solving is one of the key skills in working with your computer in any capacity and you will soon find out that the key concepts are always the same.
You realize that you are hacker, don’t you. “I liked the fact that this program worked. Then I changed a few things and just looked what happened”. What sounds like an innocent comment made by one of our participants in passing basically captures the essence of hacking. You take a program, system or computer apart and just see what happens when you change certain aspects and learn from this. This is the kind of exploring and experimenting with electronic telecommunication systems that started in the 50’s with phreaking and Captain Crunch. Of course there is the art and science of proper programming, but the essence of getting to know your computer is experimenting.
There is bad data. Finally, there are good habits and bad habits when working with computer and data. Roland opened up an Excel table today and then asked us about the 14 bad data mistakes that were contained in this table, ranging from empty cells, inconsistent formatting, inconsistent definitions etc. These are all mistakes that every one of us does every day when dealing with data. Part of the learning curve is also adopting some of standards that real computer scientists have developed. “We didn’t do bioinformatics today” was one of the final comments of the evening – a motivating comment that should remind us that the real complicated things are still to come. We will continue tomorrow morning by following the sequence right from scanner to fully annotated, interpretable data sets.
As is this just the first day of the workshop, my acknowledgements are of course not complete at this point. However, I wanted to thank our tutors Roland and Patrick today and of course Arvid, who did much of the organization of this workshop.