Posted on

GNU provided two handy tools: datamash and parallel

datamash can summarize data frames, just like the summary() function in R.

from a file that I generated two :

$ head  try.tsv
31	57
43	12
68	59
9	35
22	42
27	5
26	25
58	38
40	77
60	79

And to run datamash, just pipe in the input or use standard input..

$ datamash sum 1, sum 2 > try.tsv
384	429
$ cat try.tsv | datamash mean 1, mean 2
38.4	42.9

Apart from mean and sum, different operation can be used:

datamash --help
Usage: datamash [OPTION] op [col] [op col ...]

Performs numeric/string operations on input from stdin.

'op' is the operation to perform;
For grouping operations 'col' is the input field to use.

File operations:
  transpose, reverse
Numeric Grouping operations:
  sum, min, max, absmin, absmax
Textual/Numeric Grouping operations:
  count, first, last, rand
  unique, collapse, countunique
Statistical Grouping operations:
  mean, median, q1, q3, iqr, mode, antimode
  pstdev, sstdev, pvar, svar, mad, madraw
  pskew, sskew, pkurt, skurt, dpo, jarque

It is a lot more powerful than just getting summary, it can used for grouping and transposing data as well.

For Parallel,

its very handy if you want to run several commands in parallel such as gunzip,

$ cp try.tsv try1.tsv
$ parallel gzip {} ::: `ls *tsv`
$ ls
generateRandomNum.R try.tsv.gz          try1.tsv.gz
$ parallel gzip {} ::: `ls *tsv`
$ ls
generateRandomNum.R try.tsv             try1.tsv

the {} will be your files.

Alternatively, {.} and {/.} and use for substitution of basenames and file type.

$ parallel mv {} {.}.txt ::: *tsv
$ ls
generateRandomNum.R try.txt             try1.txt
$ cd ..
$ parallel mv {} gnu-commands/{/.}.tsv ::: gnu-commands/*txt
$ ls gnu-commands/
generateRandomNum.R try.tsv             try1.tsv

for osx:

brew install datamash
brew install parallel

Codes used in the post are deposited in here.