# Data is sexy. Fine tune your lens to capture it

Bioinformatics. If you are a biologist or you are in a life science related field, this word might sound cool to you. Integrating programming with biological high-throughput data to find out underlying causes of disease with high confidence, is definitely an exciting field. I was a pharmacy graduate and it’s been almost three years since I got into bioinformatics. Here, I would like to share a few tips (solely from personal experience) to beginner bioinformaticians or folks who are enthusiastic to learn bioinformatics.

As a beginner bioinformatician or programmer (any field for that matter), one should be able to use google effectively. Learn how to use keywords and different search patterns. The combination of search keywords make a lot of difference in the results you get.

Look at the difference in results when I added biostars. The search results will be very specific when you add keywords.

Here are some suggestions on how to use google productively - http://time.com/4116259/google-search/

#### Learn a *nix based operating system

I can’t stress this enough. If you want to learn bioinformatics, the first thing to do is switching from windows to *nix (linux or unix) based operating system. Get yourself comfortable with using terminal. You can easily parallelize chores on linux which is not possible or very difficult on windows. Unless you are sending results to a non-bioinformatician, avoid using excel (Here is why). Once you know the basics, you will understand how to use shell utilities and command line tools in your day to day work.

Always keep in mind what you are hoping/trying to identify from the data. Given the number of different methods available and the rate of new methods published, you will get curious to implement different methods on your data that might not be relevant. Control your excitement. There will only be few packages/methods that would be appropriate for your goal. Spend some time on reading about the methods (i.e. main purpose) and the packages that implements it. Consider the user base of a package and how responsive are the authors to the questions/requests raised by users before getting your feet wet (You can ask Google).

#### Polish the shoe before wearing ‘em

It’s important to analyse the good quality data to produce significant results. Devoting a good amount of time on deciding the quality of the data will save a lot time in later stages. From the data, you can identify experiments that did not work (properly) to generate the data. Those samples will mess up the actual signal you are hoping to identify. Removing such samples makes a huge difference in the results.

#### Don’t hurray if you see clusters with one method

When you work with high dimensional data, a proper transformation is important before you implement any dimension reduction/clustering method. If you see different clusters in the data with one method, check if it is consistent with any independent method (e.g. PCA). The observed clusters might be due to improper transformation of the data. If you consistently see those clusters with different independent methods, that’s the time to hurray.

As you work with different projects, you will have to perform some tasks that are common to many (if not all) projects. Identify those tasks and try to write generalized scripts/functions for them, it will save a lot of time in the long run.

For example, I found myself using DESeq2 transformed counts in many projects, I wrote a small tool for myself to do this in a quick way

Rscript getMe_DESeq2_counts.R --input counts_matrix.txt --coldata coldata.txt --output requiredCounts.txt


By this, I don’t need to open Rstudio to get things done.

#### Frequently discuss your work with an experienced biologist

The results you get at the end of the analysis are overwhelming. Make use of online over-representation / enrichment tools (in case if it is a gene list) to get a quick idea of what is there in your data and discuss it with a biologist. An experienced biologist might see what you are not seeing from the data. Whatever that is coming out of the analyses, at the end it should make biological sense.

#### Visualize your data like a pro

Learn what type of plot to pick (and not to pick) depending on what you want to show.

E.g., a venn diagram with more than 4 circles is a not a good plot. No one wants to track the numbers displayed on these plots. Instead one can use upset plots to represent the same data.

From the following two plots, which one do you more likely to read?

(P.S. I used intervene tool to generate these figures)

Whatever the programming language you use (though I strongly suggest R or python), master the best modules/libraries for data visualization. All the plots you generate in a project will not go into your manuscript. On the other hand, you never know which one you will include at the end. While generating a plot, spend few extra minutes decorating the axis labels, axis ticks, title and other parameters of a plot. If you are using R ggplot2, write your own theme (Here is how) and use it for all the plots you generate for a project.

Here is an engaging book to get you started on Data Visualization - http://serialmentor.com/dataviz/

#### Try to work with standard file formats

I used to generate intermediate files from the standard files (meaning, file formats most tools accept/generate) to work with. Believe me, I will not recommend this in my worst dreams. Generating intermediate file formats is not ideal (except in some rare cases), unless you are writing a generalized pipeline to use all the time. It is highly likely that you will forget that file format after a few months and it is time consuming to go through your old scripts and finding out what you did there.

#### Document it. Document it. Did I say “Document it”?

I emphasized the significane of documentation in one of my previous blog posts - A note to my six months younger-self

Sometimes it may be boring but remember you will have to dig through your folders to find the script you wrote to generate that figure. Don’t you think it is better to spend some time on a daily basis or immediately after the analysis to document what you did? It is, absolutely.

Documentation is a love letter you write to yourself - Anonymous

Few days back, I came across the following tweet by Ming Tang which said,

This is a very good advice. When you are documenting the details, you are writing the first draft of the methods section of your paper.

Become a member of open bioinformatics communities such as biostars.org, r/bioinformatics, SEQanswers. Answer and follow questions relevant to you. Observe what views other people in the community have on a question. This broadens your knowledge on the topic. This will not only useful to enhance your expertise on the subject, but also adds weightage to your CV.

Let me share my story. I’m an active member of biostars community. In my PhD interview, I included a slide with my biostars profile and explained what the community is about and what I do. I would definitely say this added a few points to my application. (BTW, I got that PhD position).