Summary Statistics (GNU Datamash 1.8)

5.1 Summary Statistics

The following are examples of using datamash to quickly calculate summary statistics. The examples will use a file with three fields (name, subject, score) representing grades of students:

$ cat scores.txt
Shawn     Arts  65
Marques   Arts  58
Fernando  Arts  78
Paul      Arts  63
Walter    Arts  75
...

Counting how many students study each subject (subject is the second field in the input file, thus groupby 2):

$ datamash --sort groupby 2 count 2 < scores.txt
Arts            19
Business        11
Engineering     13
Health-Medicine 13
Life-Sciences   12
Social-Sciences 15

Similarly, find the minimum and maximum score in each subject:

$ datamash --sort groupby 2 min 3 max 3 < scores.txt
Arts             46      88
Business         79      94
Engineering      39      99
Health-Medicine  72     100
Life-Sciences    14      91
Social-Sciences  27      90

find the mean and (population) standard deviation in each subject:

$ datamash --sort groupby 2 mean 3 pstdev 3 < scores.txt
Arts              68.947  10.143
Business          87.363   4.940
Engineering       66.538  19.101
Health-Medicine   90.615   8.862
Life-Sciences     55.333  19.728
Social-Sciences   60.266  16.643

Find the median, first, third quartiles and the inter-quartile range in each subject:

$ datamash --sort groupby 2 median 3 q1 3 q3  3 iqr 3  < scores.txt
Arts              71      61.5      75.5     14
Business          87      83        92        9
Engineering       56      51        83       32
Health-Medicine   91      84       100       16
Life-Sciences     58.5    44.25     67.75    23.5
Social-Sciences   62      55        70.5     15.5

See Header Lines and Column Names for examples of dealing with header lines.