← back to GNU Datamash main page
GNU datamash - Examples
GNU datamash is command-line program which performs simple calculation (e.g. count, sum, min, max, mean, stdev, string coalescing) on input files.
# sum the values in the first column $ seq 10 | datamash sum 1 55
Datamash has a rich set of statistical functions, to quickly assess information in textual input files. An example of calculating basic statistic (mean, 1st quartile, median, 3rd quartile, IQR, sample-standard-deviation, and p-value of Jarque-Bera test for normal distribution:
$ datamash -H mean 1 q1 1 median 1 q3 1 iqr 1 sstdev 1 jarque 1 < FILE.TXT mean(x) q1(x) median(x) q3(x) iqr(x) sstdev(x) jarque(x) 45.32 23 37 61.5 38.5 30.4487 8.0113e-09
Topics
- Example files
- Example: Test Scores
- Example: Header Lines
- Example: Genes
- Example: Datamash with other unix programs
- Example: Grouping Multiple Fields
- Example: Descriptive Statistics
- Example: Transpose File (swap rows, columns)
- Example: Reverse fields
Example Files
The input files used in the following examples are distributed with datamash's source code,
and can also be downloaded directly:
scores.txt,
scores_h.txt,
genes.txt,
genes_h.txt
(The "_h
" files have an additional header line as the first line of the file).
When datamash is locally installed, example files are typically in
/usr/share/datamash/examples
or
/usr/local/share/datamash/examples
.
Example: Test Scores
The file scores.txt
contains tests scores of
college students of different majors (Arts, Business, Health and Medicine,
Life Sciences, Engineering, Social Sciences).
The files has three columns: Name, Major, Score:
$ cat scores.txt Shawn Arts 65 Marques Arts 58 Fernando Arts 78 Paul Arts 63 Walter Arts 75 ...Using datamash, find the lowest (min) and highest (max) score for each College Major (Major is in column 2, the score values are in column 3):
$ datamash -g 2 min 3 max 3 < scores.txt Arts 46 88 Business 79 94 Health-Medicine 72 100 Social-Sciences 27 90 Life-Sciences 14 91 Engineering 39 99Similarly, find the number of students, mean score and sample-standard-deviation for each College major:
$ datamash -g 2 mean 3 sstdev 3 < scores.txt Arts 68.9474 10.4215 Business 87.3636 5.18214 Health-Medicine 90.6154 9.22441 Social-Sciences 60.2667 17.2273 Life-Sciences 55.3333 20.606 Engineering 66.5385 19.8814
Example: Header Lines
A header line is an optional first line in the input or output files,
which labels each column. Datamash can generate
header line in the output file, even if the input file doesn't have a header line
(scores.txt
does not have a header line,
the first line in the file contains data).
Use --header-out
to add a header line to the output
(when the input does not contain a header line):
$ datamash --header-out -g 2 count 3 mean 3 pstdev 3 < scores.txt GroupBy(field-2) mean(field-3) sstdev(field-3) Arts 68.9474 10.4215 Business 87.3636 5.18214 Health-Medicine 90.6154 9.22441 Social-Sciences 60.2667 17.2273 Life-Sciences 55.3333 20.606 Engineering 66.5385 19.8814When the input file has a header line, datamash will use the names from each column in the output header line.
scores_h.txt
contains the same information as
scores.txt
, with an additional header line:
$ cat scores_h.txt Name Major Score Shawn Arts 65 Marques Arts 58 Fernando Arts 78 Paul Arts 63 Walter Arts 75 ...Use
-H
(equivalent to --header-in --header-out
)
to use input headers and print output headers:
$ datamash -H -g 2 count 3 mean 3 pstdev 3 < scores_h.txt GroupBy(Major) mean(Score) sstdev(Score) Arts 68.9474 10.4215 Business 87.3636 5.18214 Health-Medicine 90.6154 9.22441 Social-Sciences 60.2667 17.2273 Life-Sciences 55.3333 20.606 Engineering 66.5385 19.8814
Example: Genes
note: The following examples assume basic bioinformatics knowledge.
The genes.txt
file contains a small subset of the Human Genome genes.
The columns of genes.txt
are:
- bin
- name - isoform/transcript identifier
- chromosome
- strand
- txStart - transcription start site
- txEnda - transcription end site
- cdsStart - coding start site
- cdsEnd - coding end site
- exonCount - number of exons
- exonStarts
- exonEnds
- score
- GeneName - gene identifier
- cdsStartStat
- cdsEndStat
- exonFrames
Find Number of isoforms per gene
The gene identifiers are in column 13, the transcript identifiers are in column 2.
To count how many isoforms each gene has, use datamash
to group by column 13, and for each group, count the values in column 2
(use -s
to automatically sort the input file):
$ datamash -s -g 13 count 2 < genes.txt ABCC1 1 ABCC10 2 ABCC11 3 ABCC12 1 ABCC13 2 ...Using the
collapse
operation, datamash can print all
the isoforms for each gene:
$ datamash -s -g 13 count 2 collapse 2 < genes.txt ABCC1 1 NM_004996 ABCC10 2 NM_001198934,NM_033450 ABCC11 3 NM_032583,NM_033151,NM_145186 ABCC12 1 NM_033226 ABCC13 2 NR_003087,NR_003088 ...When using a file with a header line, add
-H
:
$ datamash -H -s -g 13 count 2 collapse 2 < genes_h.txt GroupBy(name2) count(name) collapse(name) ABCC1 1 NM_004996 ABCC10 2 NM_001198934,NM_033450 ABCC11 3 NM_033151,NM_145186,NM_032583 ABCC12 1 NM_033226 ABCC13 2 NR_003088,NR_003087 ...
Datamash with other unix programs
Datamash is designed as a unix filter program, reading from standard input and writing to standard output. It works well with the common unix programs (such as awk) for additional functionality:
Find genes with more than 5 isoforms:$ cat genes.txt | datamash -s -g 13 count 2 collapse 2 | awk '$2>5' AC159540.1 6 NR_040097,NR_103732,NR_103733,NR_040097,NR_103732,NR_103733 ACSF3 6 NM_001127214,NM_001243279,NM_001284316,NM_174917,NR_045667,NR_104293 ADAM29 7 NM_001130703,NM_001130704,NM_001130705,NM_001278125,NM_001278126,NM_001278127,NM_014269 AIPL1 8 NM_001033054,NM_001033055,NM_001285399,NM_001285400,NM_001285401,NM_001285402,NM_001285403,NM_014336 ANXA8 6 NM_001040084,NM_001271702,NM_001271703,NM_001040084,NM_001271702,NM_001271703 ...
Which genes are transcribes from both strands?
(that is, they have isoforms with both positive and negative strands. strand column is number 4)
$ cat genes.txt | datamash -s -g 13 countunique 4 | awk '$2>1' AC159540.1 2 AMY1C 2 ANXA8 2 BMS1P17 2 BMS1P18 2 ...
Which genes are transcribes from multiple chromosomes?
(that is, they have isoforms from multiple chromosomes. Chromosome column is number 3):
$ cat genes.txt | datamash -s -g 13 countunique 3 unique 3 | awk '$2>1' AKAP17A 2 chrX,chrY ASMT 2 chrX,chrY ASMTL 2 chrX,chrY ASMTL-AS1 2 chrX,chrY BMS1P17 2 chr14,chr22 ...Examine Exon-count variability
(for each gene, list the minimum, maximum, mean and stddev of the exon-count of its isoforms. Exon-Count column is number 9):
$ cat genes.txt | datamash -s -g 13 count 9 min 9 max 9 mean 9 pstdev 9 | awk '$2>1' ABCC10 2 20 22 21 1 ABCC11 3 29 30 29.3 0.471405 ABCC13 2 5 6 5.5 0.5 ABCC3 2 12 31 21.5 9.5 AC159540.1 6 4 5 4.1 0.372678 ...
Example: Grouping multiple fields
Chromosome name is in column 3. How many transcripts are in each chromosome?$ datamash -s -g 3 count 2 < genes.txt chr1 365 chr10 164 chr11 189 chr12 187 chr13 66 ...
Strand information is in column 4. How many transcripts are in each chromsome AND strand?
To find out, group by two columns: Column 3 (chromosome name) and Column 4 (strand):
$ datamash -s -g 3,4 count 2 < genes.txt chr1 - 183 chr1 + 182 chr10 - 52 chr10 + 112 chr11 - 105 chr11 + 84 chr12 - 117 chr12 + 70 ...
Example: Descriptive Statistics
Calculating the Five-Number Summary: of all values in the first column of the input file:
$ datamash min 1 q1 1 median 1 q3 1 max 1 < FILE.TXT 78 93 100 107 120
The same command, with header lines for better clarity:
$ datamash -H min 1 q1 1 median 1 q3 1 max 1 < FILE.TXT min(x) q1(x) median(x) q3(x) max(x) 78 93 100 107 120
Finding out the count,mean and sample standard-deviation:
$ datamash -H count 1 mean 1 sstdev 1 < FILE.TXT count(x) mean(x) sstdev(x) 100 100.06 9.5767184
Example: Transpose File (swap rows, columns)
Use transpose to swap rows and columns in a file:
$ cat input.txt Sample Year Count id-123 2014 1002 id-99 2013 990 id-42 2014 2030 id-13 2014 599 $ datamash transpose < input.txt Sample id-123 id-99 id-42 id-13 Year 2014 2013 2014 2014 Count 1002 990 2030 599By default, transpose verifies the input has the same number of fields in each line, and fails with an error otherwise:
$ cat input1.txt Sample Year Count id-123 2014 1002 id-99 2013 id-42 2014 2030 id-13 2014 599 $ datamash transpose < input1.txt datamash: transpose input error: line 3 has 2 fields (previous lines had 3); see --help to disable strict modeUse --no-strict to allow missing values:
$ datamash --no-strict transpose < input1.txt Sample id-123 id-99 id-42 id-13 Year 2014 2013 2014 2014 Count 1002 N/A 2030 599 # Use --filler to set the missing-field-filler value: $ datamash --no-strict --filler XYZ transpose < input1.txt Sample id-123 id-99 id-42 id-13 Year 2014 2013 2014 2014 Count 1002 XYZ 2030 599
Example: Reverse Fields
Use reverse to reverse the fields order in a file:
$ cat input.txt Sample Year Count id-123 2014 1002 id-99 2013 990 id-42 2014 2030 id-13 2014 599 $ datamash reverse < input.txt Count Year Sample 1002 2014 id-123 990 2013 id-99 2030 2014 id-42 599 2014 id-13By default, reverse verifies the input has the same number of fields in each line, and fails with an error otherwise. Use --no-strict to disable this behaviour (See transpose section above for an example).