[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
One feature of combine
is aggregation. In other words, based on
a set of keys in a data file, create a new file containing a summary record
for each unique combination of values that appears in those keys.
This is a special case of reference-file based output in that the reference file and the data file are the same file. We are using it both to build the list of keys and to provide the values to summarize.
Here is a simple example. Suppose my 4-year-old were flipping Sacajawea(3) dollar coins and recording the results, along with the time and the date. A portion of the resulting file might look like this:
Monday 27 Oct 2003 14:02:01 Head 1.00 Monday 27 Oct 2003 14:03:05 Head 1.00 Monday 27 Oct 2003 14:03:55 Tail 1.00 Monday 27 Oct 2003 14:04:30 Head 1.00 Monday 27 Oct 2003 14:06:12 Tail 1.00 Monday 27 Oct 2003 14:08:43 Head 1.00 Monday 27 Oct 2003 14:54:52 Head 1.00 Monday 27 Oct 2003 19:59:59 Tail 1.00 Tuesday 28 Oct 2003 08:02:01 Tail 5.00 Tuesday 28 Oct 2003 08:02:16 Tail 5.00 Tuesday 28 Oct 2003 08:02:31 Head 5.00 Tuesday 28 Oct 2003 08:02:46 Tail 5.00 Wednesday 29 Oct 2003 12:02:09 Head 10.00 |
Then if I wanted to measure her daily performance, I could count the number of coin flips per day, with the following command.
The option ‘--data-is-reference’ tells the program to read the file only once, bulding the list of keys as it goes. This is useful for aggregations, and it requires that the key fields be the same and that we only keep one copy of each key (with the ‘--unique’ option).
The records come out in an unspecified order. If you require a specific
order, you can pipe the result through the sort
command.
% combine --input-field-delimiter=" " --data-is-reference \ --count --reference-file=testcoin.txt \ --output-field-delimiter="," -D " " \ --key-fields=2 --data-key-fields=2 --write-output \ --output-fields=2 --unique testcoin.txt 28 Oct 2003,4 29 Oct 2003,1 27 Oct 2003,8 Statistics for reference file testcoin.txt Number of records read: 13 Number of records dropped by filter: 0 Number of records stored: 3 Number of records matched on key: 3 Number of records matched fully: 3 Number of reference-data matches: 13 |
An equivalent SQL statement, assuming that the table has been created would be.
SELECT Date, COUNT (Result) FROM Coinflip GROUP BY Date; |
If she wanted to count the number of heads and tails, and also the amount she bet on each, she could do the following. (I have shortened the options you have already seen to their 1-byte equivalents.)
The specification of the field to sum says to keep 2 decimal places.
combine
works with integers if you don’t say otherwise. The
option ‘--counter-size’ tells the program to use at least that
many bytes to present counters and sums.
% combine -D " " -M -n --sum-fields=5.2 --counter-size=8 \ -r testcoin.txt -d "," -D " " -k 4 -m 4 -w -o 4 \ -u testcoin.txt Head, 7, 20.00 Tail, 6, 18.00 Statistics for reference file testcoin.txt Number of records read: 13 Number of records dropped by filter: 0 Number of records stored: 2 Number of records matched on key: 2 Number of records matched fully: 2 Number of reference-data matches: 13 |
An equivalent SQL statement, assuming again that the table has been created would be.
SELECT Result, COUNT (Result), Sum (Wager) FROM Coinflip GROUP BY Result; |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] |
This document was generated by Daniel P. Valentine on July 28, 2013 using texi2html 1.82.