It is often the case in Genomics workflows to reshape the data to apply a statistical method or to visualize obvious patterns in the data. I use
R language for 90% of my analyses and I would like to share some of the code I regulary use in my work.
Before joining the party, let me tell you that I love UNIX command line a lot. I disperately learnt the uses of
piping commands and
process substitution early on and addicted to it. Later, in the process of learning
R in depth, I came to know that there exists something similar to piping in
R. I started experimenting with it and somehow became comfortable using R piping in my day to day (epi)genomics analysis.
Here is a piece of
R code that requires
tibble packags to format the data. E.g RNA-seq count tables or ChIP-seq count tables. Most read counting tools output the results in table formated text files. We always need polishing of these tables to proceed to downsstream analysis. A typical example of ChIP-seq count tables will look like,
cat notes_01.txt chr start end sample1 sample2 sample3 chr1 120 130 1 0 14 chr1 150 200 35 12 56 chr2 300 500 67 56 78 chr4 250 400 13 24 90 ... ...
Above table should be formated in such a way that chr, start, end become rownames, so in the downstream process one can identify which regions are interesting from the observations of the analysis. Using
tibble package, following code with
piping can result in what we need.
library(dplyr) dat=read.delim("notes_01.txt", header = TRUE) dat = dat %>% dplyr::mutate(region=paste(chr, start, end, sep="_")) %>% dplyr::select(-c(chr, start, end)) %>% tibble::column_to_rownames(var="region")
This will result in
sample1 sample2 sample3 chr1_120_130 1 0 14 chr1_150_200 35 12 56 chr2_300_500 67 56 78 chr4_250_400 13 24 90
Now this modified table/data.frame can be used for clustering/visualization from within
R. With a slight modification, this code can also be applied to RNA-seq count tables. That is for you to experiment.
Have fun!comments powered by Disqus