Next: Header Lines and Column Names, Up: Usage Examples [Contents][Index]
The following are examples of using datamash
to quickly
calculate summary statistics. The examples will use a file with three
fields (name, subject, score) representing grades of students:
$ cat scores.txt Shawn Arts 65 Marques Arts 58 Fernando Arts 78 Paul Arts 63 Walter Arts 75 ...
Counting how many students study each subject (subject is the second field in the input file, thus groupby 2):
$ datamash --sort groupby 2 count 2 < scores.txt Arts 19 Business 11 Engineering 13 Health-Medicine 13 Life-Sciences 12 Social-Sciences 15
Similarly, find the minimum and maximum score in each subject:
$ datamash --sort groupby 2 min 3 max 3 < scores.txt Arts 46 88 Business 79 94 Engineering 39 99 Health-Medicine 72 100 Life-Sciences 14 91 Social-Sciences 27 90
find the mean and (population) standard deviation in each subject:
$ datamash --sort groupby 2 mean 3 pstdev 3 < scores.txt Arts 68.947 10.143 Business 87.363 4.940 Engineering 66.538 19.101 Health-Medicine 90.615 8.862 Life-Sciences 55.333 19.728 Social-Sciences 60.266 16.643
Find the median, first, third quartiles and the inter-quartile range in each subject:
$ datamash --sort groupby 2 median 3 q1 3 q3 3 iqr 3 < scores.txt Arts 71 61.5 75.5 14 Business 87 83 92 9 Engineering 56 51 83 32 Health-Medicine 91 84 100 16 Life-Sciences 58.5 44.25 67.75 23.5 Social-Sciences 62 55 70.5 15.5
See Header Lines and Column Names for examples of dealing with header lines.
Next: Header Lines and Column Names, Up: Usage Examples [Contents][Index]