5 Simple Bioinformatics Tips That Will Improve Your Research

5 Simple Bioinformatics Tips That Will Improve Your Research

These five simple tips shall give you some ideas on how to improve your research by performing better data analysis. Feedback in the comments below is very welcome!

1. Visualize and Explore Your Data

In our experience, easy data visualization is one of the most desired features in bioinformatics software. It is invaluable in troubleshooting data analysis as we have described in the previous blog post. However, it also opens a whole new world in the way we do research. Currently, most of the research is purely hypothesis-driven, which means experimental data and its visualization only serves as a validation method. In data-driven research approach, new hypotheses are generated leveraging pattern recognition in series of interactive data visualizations, that are produced in automated or semi-automated ways. We see the systematic use of interactive data visualization as a core biological research tool.

Here is a short clip from GenExpress, our visualization platform for genomic data.




2. Script Your Analyses

Based on our poll about most used bioinformatics tools, Microsoft Excel is still mostly used “bioinformatics tool”. If we keep aside the regular data formatting issues of Excel, a more important issue is that it’s lacking analysis reproducability features. So when you do your analysis, you perform a certain set of steps to produce the results desired. If the input data is changed for some reason, it is extremely difficult to redo the analysis in exactly the same way. Also, it’s virtually impossible to formally describe the analysis you have performed and to share it with the rest of the research community.

That’s why scripting your analysis in one (or multiple) scripting languages is key to a repeatable and well-defined research. It is also a big time saver in the long run, and a wonderful way to have your projects self-documented. Ideally, every analysis should be completely reproducible by running a single console command. All the data parsing, configuration file preparation and execution of third party tools can be run by this single script.

3. Consider Cloud

Uploading and analyzing data on an external infrastructure is becoming very popular in the research community. While it may be less suitable for large biotech companies with high security requirements, it usually solves many problems for researchers. First, it enables you to focus on your research, as setting up a computer infrastructure is usually not something a life scientists want to deal with. Second, it is flexible in terms of paying only for the resources that are spent. Also, it usually allows for better data organization and data archiving. Instead of having those reads stored on multiple external hard drives somewhere around the office, everything is in one place, stored for potential future reuse.

4. Don’t Reinvent The Wheel

In bioinformatics, standing on shoulders of giants is especially important. Many people have solved tons of data analysis problems and when you are starting your analysis, it makes a lot of sense to first search most popular bioinformatics forums. Not only when stumbling upon problems, but also to get some best practice advice. SeqAnswers and BioStars are two most popular forums for bioinformatics questions. Of course, search the archives before posting or expect people to flame you.

Also, there are tons of bioinformatics tools lists out there.

5. Learn Linux Console

Most of the bioinformatics tools run on Linux. While it may seem very complicated in the beginning, after a few weeks of active learning, one can easily manipulate data using standard command line tools like samtools and bowtie. Many people are “afraid” of the console, but armed with at least the basic skillset of parsing and format manipulation, a lot of basic data analysis can be done in a very short time. After you master these basic, the hard part becomes the tuning of gazillion of parameters to these command line tools, that influence the data analysis.