Next: , Previous: , Up: Datamash   [Contents][Index]


2 Invoking datamash

The format for running the datamash program is:

datamash [option]… op1 column1  [op2 column2 …]

Where op1 is the operation to perform on the values in column1. datamash reads input from stdin and performs one or more operations on the input data. If --group is used, each operation is performed on every group. If --group is not used, each operation is performed on all the values in the input file.

The LC_NUMERIC locale specifies the decimal-point character and the thousands separator.

datamash supports the following operations:

Primary operations:

groupby, crosstab, transpose, reverse, check

Line-Filtering operations:

rmdup

Per-Line operations:

base64, debase64, md5, sha1, sha224, sha256, sha384, sha512, bin, strbin, round, floor, ceil, trunc, frac, dirname, basename, extname, barename, getnum, cut, echo

Group-by Numeric operations:

sum, min, max, absmin, absmax, range

Group-by Textual/Numeric operations:

count, first, last, rand, unique, uniq, collapse, countunique

Group-by Statistical operations:

mean, geomean, harmmean, mode, median, q1, q3, iqr, perc, antimode, pstdev, sstdev, pvar, svar, ms, rms, mad, madraw, sskew, pskew, skurt, pkurt, jarque, dpo, scov, pcov, spearson, ppearson

Grouping options:

--skip-comments
-C

Skip comment lines (starting with ’#’ or ’;’ and optional whitespace).

--full
-f

Print entire input line before op results (default: print only the grouped keys). While using this option with non-linewise operations was historically permitted, it never produced very sensible output. Such usage has been deprecated, and in a future release it will result in an error.

--group=X[,Y,X]
-g X[,Y,X]

Group input via fields X[,Y,Z]. By default, fields are separated by TABs. Use --field-separator to change the delimiter character. Input file must be sorted by the same fields X[,Y,Z]. Use --sort to automatically sort the input. If --group is not specified, each operation is performed in the entire input file.

--header-in

Indicates the first input line is column headers, and should not be used for any calculations.

--header-out

Print column headers as first line. If the column header names are known (i.e. the input file had a header line, and the command was invoked with --header-in, -H or --headers), prints the operation and the name of the field (e.g. ‘mean(X)’). Otherwise, prints the number operation and the field number (e.g. ‘mean(field-3)’).

--headers
-H

Same as ‘--header-in --header-out’. A short option indicating the input file has a header line, and the output should contain a header line as well.

--ignore-case
-i

Ignore upper/lower case when comparing text for grouping, sorting, and comparing unique values in the ‘countunique’ and ‘unique’ (or ‘uniq’) operations.

--sort
-s

Sort the input before grouping. datamash requires sorted input. If the input is not sorted, using --sort will automatically sort the input before processing it further. Sorting will be performed based on the specified --group parameter, and respecting case --ignore-case option (if used). The following commands are equivalent:

$ cat FILE | sort -k1,1 | datamash --group 1 sum 1
$ cat FILE | datamash --sort --group 1 sum 1
--sort-cmd=PATH

Use the given program to sort instead of the system sort

File Operation options:

--no-strict

Allow lines with varying number of fields. By default, transpose and reverse will fail with an error message unless all input lines have the same number of fields.

--filler=x

When use --no-strict option, missing fields will be filled with this value.

General options:

--format=FORMAT

print numeric values with printf style floating-point FORMAT.

--field-separator=x
-t x

Use character X instead of TAB as input and output field delimiter. If --output-delimiter is also used, it will override the output field delimiter.

--narm

Skip NA or NaN values.

--output-delimiter=x

Use character X instead as output field delimiter. This option overrides --field-separator/-t/ --whitespace/-W.

--collapse-delimiter=x
-c x

Use character X instead of comma to delimit items in a ‘collapse’ or ‘unique’ (aka ‘uniq’) list.

--round=N
-R N

Round numeric output to N decimal places.

--whitespace
-W

Use whitespace (one or more spaces and/or tabs) for field delimiters. Leading whitespace is ignored, trailing whitespace results in an empty field. TAB character will be used as output field separator. If --output-delimiter is also used, it will override the output field delimiter.

--zero-terminated
-z

End lines with a 0 byte, not newline.

--help

Print an informative help message on standard output and exit successfully.

--version

Print the version number and licensing information of Datamash on standard output and then exit successfully.


Next: Available operations in datamash, Previous: Overview, Up: Datamash   [Contents][Index]