combine Manual 0.4.0: 5.2 Aggregation

5.2 Aggregation

One feature of combine is aggregation. In other words, based on a set of keys in a data file, create a new file containing a summary record for each unique combination of values that appears in those keys.

This is a special case of reference-file based output in that the reference file and the data file are the same file. We are using it both to build the list of keys and to provide the values to summarize.

Here is a simple example. Suppose my 4-year-old were flipping Sacajawea(3) dollar coins and recording the results, along with the time and the date. A portion of the resulting file might look like this:

Monday	27 Oct 2003	14:02:01	Head	1.00
Monday	27 Oct 2003	14:03:05	Head	1.00
Monday	27 Oct 2003	14:03:55	Tail	1.00
Monday	27 Oct 2003	14:04:30	Head	1.00
Monday	27 Oct 2003	14:06:12	Tail	1.00
Monday	27 Oct 2003	14:08:43	Head	1.00
Monday	27 Oct 2003	14:54:52	Head	1.00
Monday	27 Oct 2003	19:59:59	Tail	1.00
Tuesday	28 Oct 2003	08:02:01	Tail	5.00
Tuesday	28 Oct 2003	08:02:16	Tail	5.00
Tuesday	28 Oct 2003	08:02:31	Head	5.00
Tuesday	28 Oct 2003	08:02:46	Tail	5.00
Wednesday	29 Oct 2003	12:02:09	Head	10.00

Then if I wanted to measure her daily performance, I could count the number of coin flips per day, with the following command.

The option ‘--data-is-reference’ tells the program to read the file only once, bulding the list of keys as it goes. This is useful for aggregations, and it requires that the key fields be the same and that we only keep one copy of each key (with the ‘--unique’ option).

The records come out in an unspecified order. If you require a specific order, you can pipe the result through the sort command.

% combine --input-field-delimiter="	" --data-is-reference \
          --count --reference-file=testcoin.txt \
          --output-field-delimiter="," -D "	" \
          --key-fields=2 --data-key-fields=2 --write-output \
          --output-fields=2 --unique testcoin.txt
28 Oct 2003,4
29 Oct 2003,1
27 Oct 2003,8
Statistics for reference file testcoin.txt
  Number of records read:                           13
  Number of records dropped by filter:               0
  Number of records stored:                          3
  Number of records matched on key:                  3
  Number of records matched fully:                   3
  Number of reference-data matches:                 13

An equivalent SQL statement, assuming that the table has been created would be.

SELECT Date, COUNT (Result)
  FROM Coinflip
  GROUP BY Date;

If she wanted to count the number of heads and tails, and also the amount she bet on each, she could do the following. (I have shortened the options you have already seen to their 1-byte equivalents.)

The specification of the field to sum says to keep 2 decimal places. combine works with integers if you don’t say otherwise. The option ‘--counter-size’ tells the program to use at least that many bytes to present counters and sums.

% combine -D "	" -M -n --sum-fields=5.2 --counter-size=8 \
          -r testcoin.txt -d "," -D "	" -k 4 -m 4 -w -o 4 \
          -u testcoin.txt
Head,       7,   20.00
Tail,       6,   18.00
Statistics for reference file testcoin.txt
  Number of records read:                           13
  Number of records dropped by filter:               0
  Number of records stored:                          2
  Number of records matched on key:                  2
  Number of records matched fully:                   2
  Number of reference-data matches:                 13

An equivalent SQL statement, assuming again that the table has been created would be.

SELECT Result, COUNT (Result), Sum (Wager)
  FROM Coinflip
  GROUP BY Result;

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

This document was generated by Daniel P. Valentine on July 28, 2013 using texi2html 1.82.