Posted on

Unix system embraces the use of pipe |. An example using ensembl gene annotation:

$ curl ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz \
	| zcat \
	| head

Even in R, the package dplyr provides an interface to pipe input through multiple operations (see my previous post for detail). And I am a big fan of using these kind of pipes, it is very common to include several operators in one line of command under this design. However, it was the only downside that I can't check the outputs the whole pipe line until I found out the tee command in unix.

$ curl ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz \
	| zcat \
	| awk '$1==1' \
	| grep miRNA \
	| tee ~/Desktop/chr1.miRNA.gtf \
	| grep transcript \
	| wc -l

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 47.4M  100 47.4M    0     0   324k      0  0:02:29  0:02:29 --:--:--  306k
     632

This will save all the chromosome 1 miRNA records in the file chr1.miRNA.gtf on desktop and also count the number of miRNA (632) on chromosome 1.

So the tee comand is actually writing the input into a file and prinitng them as standard output at the same time. I am surprise this command is very underuse among Unix users, I think it can be very handy to everyone and should at least incoporate into one of your pipelines.