Invoking datamash (GNU Datamash 1.9)

Next: Available operations in datamash, Previous: Overview, Up: Datamash [Contents][Index]

2 Invoking `datamash` ¶

The format for running the datamash program is:

datamash [option]... op1 column1  [op2 column2 ...]

Where op1 is the operation to perform on the values in column1. datamash reads input from stdin and performs one or more operations on the input data. If --group is used, each operation is performed on every group. If --group is not used, each operation is performed on all the values in the input file.

The LC_NUMERIC locale specifies the decimal-point character and the thousands separator.

datamash supports the following operations:

Primary operations:: groupby, crosstab, transpose, reverse, check
Line-Filtering operations:: rmdup
Per-Line operations:: base64, debase64, md5, sha1, sha224, sha256, sha384, sha512, bin, strbin, round, floor, ceil, trunc, frac, dirname, basename, extname, barename, getnum, cut, echo
Group-by Numeric operations:: sum, min, max, absmin, absmax, range
Group-by Textual/Numeric operations:: count, first, last, rand, unique, uniq, collapse, countunique
Group-by Statistical operations:: mean, geomean, harmmean, mode, median, q1, q3, iqr, perc, antimode, pstdev, sstdev, pvar, svar, ms, rms, mad, madraw, sskew, pskew, skurt, pkurt, jarque, dpo, scov, pcov, spearson, ppearson, dotprod

Grouping options:

--skip-comments ¶

-C

Skip comment lines (starting with ’#’ or ’;’ and optional whitespace).

--full ¶

-f

Print entire input line before op results (default: print only the grouped keys). While using this option with non-linewise operations was historically permitted, it never produced very sensible output. Such usage has been deprecated, and in a future release it will result in an error.

--group=X[,Y,Z] ¶

-g X[,Y,Z]

Group input via fields X[,Y,Z]. By default, fields are separated by TABs. Use --field-separator to change the delimiter character. Input file must be sorted by the same fields X[,Y,Z]. Use --sort to automatically sort the input. If --group is not specified, each operation is performed in the entire input file. Ranges of field numbers like X-Z are also supported.

--header-in ¶

Indicates the first input line is column headers, and should not be used for any calculations.

--header-out ¶

Print column headers as first line. If the column header names are known (i.e. the input file had a header line, and the command was invoked with --header-in, -H or --headers), prints the operation and the name of the field (e.g. ‘mean(X)’). Otherwise, prints the number operation and the field number (e.g. ‘mean(field-3)’).

--headers ¶

-H

Same as ‘--header-in --header-out’. A short option indicating the input file has a header line, and the output should contain a header line as well.

--vnlog ¶

Enable experimental support for the vnlog data file format for both input and output. This format is explained at https://github.com/dkogan/vnlog.

--ignore-case ¶

-i

Ignore upper/lower case when comparing text for grouping, sorting, and comparing unique values in the ‘countunique’ and ‘unique’ (or ‘uniq’) operations.

--sort ¶

-s

Sort the input before grouping. datamash requires sorted input. If the input is not sorted, using --sort will automatically sort the input before processing it further. Sorting will be performed based on the specified --group parameter, and respecting case --ignore-case option (if used). The following commands are equivalent:

$ cat FILE | sort -k1,1 | datamash --group 1 sum 1
$ cat FILE | datamash --sort --group 1 sum 1

--sort-cmd=PATH ¶

Use the given program to sort instead of the system sort

File Operation options:

--no-strict ¶: Allow lines with varying number of fields. By default, transpose and reverse will fail with an error message unless all input lines have the same number of fields.
--filler=x ¶: When use --no-strict option, missing fields will be filled with this value.

General options:

--format=FORMAT ¶: print numeric values with printf style floating-point FORMAT.
--field-separator=x ¶
-t x: Use character X instead of TAB as input and output field delimiter. If --output-delimiter is also used, it will override the output field delimiter.
--narm ¶: Skip NA or NaN values.
--output-delimiter=x ¶: Use character X instead as output field delimiter. This option overrides --field-separator/-t/ --whitespace/-W.
--collapse-delimiter=x ¶
-c x: Use character X instead of comma to delimit items in a ‘collapse’ or ‘unique’ (aka ‘uniq’) list.
--round=N ¶
-R N: Round numeric output to N decimal places.
--whitespace ¶
-W: Use whitespace (one or more spaces and/or tabs) for field delimiters. Leading whitespace is ignored, trailing whitespace results in an empty field. TAB character will be used as output field separator. If --output-delimiter is also used, it will override the output field delimiter.
--seed ¶
-S: Select a specific random seed. By default, GNU Datamash uses getrandom(2), which should be suitable for most purposes. You may wish to force a specific seed if you either wish to draw on a specific entropy source or for ensuring the reproducibility of a specific test.
--zero-terminated ¶
-z: End lines with a 0 byte, not newline.
--help ¶
-h: Print an informative help message on standard output and exit successfully.
--version ¶
-V: Print the version number and licensing information of Datamash on standard output and then exit successfully.

2 Invoking datamash ¶

2 Invoking `datamash` ¶