[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
There are lots of ways to apply combine
to problems that will pop up
in the course of day-to-day life. (Of course this fits into your day-to-day
life a bit more if you have a job or hobby that regularly puts you into
contact with data files.)
Here are a few topics explored in more detail.
5.1 Rearranging Fields | ||
5.2 Aggregation | ||
5.3 Finding Changes | ||
5.4 Hierarchies |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
When you do not use any reference files, combine
still gives you the
opportunity to create a new record layout based on the records you read.
This is an advantage over the cut
utility because while cut
only allows you to omit portions of the record, combine
also allows
you to reorder those fields you keep and to add a constant field somewhere in the
order. In addition, combine
gives you the chance to convert
between fixed-width and delimited formats, where cut
keeps the
format you started with (although the GNU version does let you change
delimiters).
Clearly, flexible tools like awk
or sed
or any programming
language will also make this kind of thing (and with a little work anything
combine
can do) possible. It may be that they are a more efficient choice,
but I have never tested it.
As an example, here is a fixed width file, which contains in its record layout some address book information. If I need to make a tab-delimited file of names and phone numbers to upload into my mobile phone, I can use the command that follows to get the output I want.
$ cat testadd.txt 2125551212Doe John 123 Main StreetNew York NY10001 2025551212Doe Mary 123 Main StreetWashingtonDC20001 3015551212Doe Larry 123 Main StreetLaurel MD20707 6175551212Doe Darryl 123 Main StreetBoston MA02115 6035551212Doe Darryl 123 Main StreetManchesterNH02020 |
Here is a command that grabs the first and last name and the phone number and tacks the word "Home" on the end so that my phone marks the number with a little house.(2)
Note that the statistics and the output all show up on the screen if you do
not say otherwise. The statistics are on stderr
and the output on
stdout
, so you can redirect them differently. You can also use the
option ‘--output-file’ (or ‘-t’) to provide an output file,
and you can suppress the statistics if you want with ‘--no-statistics’.
% combine --write-output --output-field-delimiter=" " \ --output-fields=21-30,11-20,1-10 \ --output-constant="Home" testadd.txt Statistics for data file testadd.txt Number of records read: 5 Number of records dropped by filter: 0 Number of records matched on key: 5 Number of records written: 5 John Doe 2125551212 Home Mary Doe 2025551212 Home Larry Doe 3015551212 Home Darryl Doe 6175551212 Home Darryl Doe 6035551212 Home |
The delimiter between the quotes and in the output was a tab character, and it worked, but in some formats it comes out in a space when printed.
For reference, here is a comparable SQL query that would select the same data assuming a table were set up containing the data in the file above.
SELECT First_Name, Last_Name, Phone_Number, 'Home' FROM Address_Book; |
A comparable gawk
program would be something like this.
BEGIN {OFS = "\t"} { print substr ($0, 21, 10), substr ($0, 11, 10), substr ($0, 1, 10), "Home"; } |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
One feature of combine
is aggregation. In other words, based on
a set of keys in a data file, create a new file containing a summary record
for each unique combination of values that appears in those keys.
This is a special case of reference-file based output in that the reference file and the data file are the same file. We are using it both to build the list of keys and to provide the values to summarize.
Here is a simple example. Suppose my 4-year-old were flipping Sacajawea(3) dollar coins and recording the results, along with the time and the date. A portion of the resulting file might look like this:
Monday 27 Oct 2003 14:02:01 Head 1.00 Monday 27 Oct 2003 14:03:05 Head 1.00 Monday 27 Oct 2003 14:03:55 Tail 1.00 Monday 27 Oct 2003 14:04:30 Head 1.00 Monday 27 Oct 2003 14:06:12 Tail 1.00 Monday 27 Oct 2003 14:08:43 Head 1.00 Monday 27 Oct 2003 14:54:52 Head 1.00 Monday 27 Oct 2003 19:59:59 Tail 1.00 Tuesday 28 Oct 2003 08:02:01 Tail 5.00 Tuesday 28 Oct 2003 08:02:16 Tail 5.00 Tuesday 28 Oct 2003 08:02:31 Head 5.00 Tuesday 28 Oct 2003 08:02:46 Tail 5.00 Wednesday 29 Oct 2003 12:02:09 Head 10.00 |
Then if I wanted to measure her daily performance, I could count the number of coin flips per day, with the following command.
The option ‘--data-is-reference’ tells the program to read the file only once, bulding the list of keys as it goes. This is useful for aggregations, and it requires that the key fields be the same and that we only keep one copy of each key (with the ‘--unique’ option).
The records come out in an unspecified order. If you require a specific
order, you can pipe the result through the sort
command.
% combine --input-field-delimiter=" " --data-is-reference \ --count --reference-file=testcoin.txt \ --output-field-delimiter="," -D " " \ --key-fields=2 --data-key-fields=2 --write-output \ --output-fields=2 --unique testcoin.txt 28 Oct 2003,4 29 Oct 2003,1 27 Oct 2003,8 Statistics for reference file testcoin.txt Number of records read: 13 Number of records dropped by filter: 0 Number of records stored: 3 Number of records matched on key: 3 Number of records matched fully: 3 Number of reference-data matches: 13 |
An equivalent SQL statement, assuming that the table has been created would be.
SELECT Date, COUNT (Result) FROM Coinflip GROUP BY Date; |
If she wanted to count the number of heads and tails, and also the amount she bet on each, she could do the following. (I have shortened the options you have already seen to their 1-byte equivalents.)
The specification of the field to sum says to keep 2 decimal places.
combine
works with integers if you don’t say otherwise. The
option ‘--counter-size’ tells the program to use at least that
many bytes to present counters and sums.
% combine -D " " -M -n --sum-fields=5.2 --counter-size=8 \ -r testcoin.txt -d "," -D " " -k 4 -m 4 -w -o 4 \ -u testcoin.txt Head, 7, 20.00 Tail, 6, 18.00 Statistics for reference file testcoin.txt Number of records read: 13 Number of records dropped by filter: 0 Number of records stored: 2 Number of records matched on key: 2 Number of records matched fully: 2 Number of reference-data matches: 13 |
An equivalent SQL statement, assuming again that the table has been created would be.
SELECT Result, COUNT (Result), Sum (Wager) FROM Coinflip GROUP BY Result; |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
If you get the same file again and again, and you want to see what’s
different, the canonical method is to use the diff
command.
It will list for you those records that are unique to each file, and
those that appear to have changed. One drawback is that it requires
the records in the two files to be in the same order. Another is that
it does not pay attention to the key fields that may be important to
you.
When I want to find the changes in a file, including additions,
deletions, and any changes for the information about a key I normally
use 4 applications of combine
. One command is suffucuent to
find the new keys. One is enough to find those that are no longer there.
A third finds everything that is different, and the fourth finds the
before and after image of keys whose information has changed.
As an example, suppose my bridge club’s roster has a name and a phone number, separated by commas, for each member. The following four commands will identify the new members, the departed members, and those whose phone number has changed.
%# Reference file is optional, so unmatched data records will have %# 2 blank fields at the end. % combine -D , -d , -w -o 1-2 -r old_roster.txt -D , -k 1 -m 1 -p \ -o 1-2 inew_roster.txt | grep -v ' , $' > new_member.txt %# Data file is optional, so unmatched reference records will have %# 2 blank fields at the beginning. % combine -D , -d , -p -w -o 1-2 -r old_roster.txt -D , -k 1 -m 1 \ -o 1-2 inew_roster.txt | grep -v '^ , ' > departed_member.txt %# Both files are optional, so any unmatched records will have %# blank fields at the beginning or end. Here we match on the %#entire record rather than the key as in the above two examples. % combine -D , -d , -p -w -o 1-2 -r old_roster.txt -D , \ -k 1-2 -m 1-2 -o 1-2 -p inew_roster.txt | grep -v ' , ' \ > all_changes.txt %# Match up the before and after versions of an entry to see changes. % combine -D , -d , -w -o 1-2 -r all_changes.txt -D , -k 3 -m 1 -o 2 \ all_changes.txt | grep -v ' , ' > changed_member.txt |
TO BE CONTINUED
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
A hierarchy tends to be an organization where there is a directional
one-to-many relationship between parent records and their associated
children. combine
works with hierarchies within reference files
when the file has a record for each node and each record points to its
parent in the hierarchy.
Because the hierarchy is assumed to be stored in a reference file, it is accessed by matching to a data record. Once an individual reference record has been matched to the data record, its relationship to other records within the hierarchy is followed through the hierarchy until there is no further to go.
The standard process is to assume that the key that matched to the
data file key is at the top of the hierarchy. When traversing the
hierarchy, combine
looks for the key on the current record in
the hierarchy key of other reference records. This repeats until
there are no further linkages from one record to the next. For each
record that is linked to the hierarchy, that record is treated as a
reference record that matched the data record.
In this section, we’ll use the following hierarchy file. It is a simple hierarchy tree with ‘Grandfather’ as the top node and 2 levels of entries below.
Grandfather, Father,Grandfather Uncle,Grandfather Me,Father Brother,Father Cousin,Uncle |
If my data file consisted only of a record with the key ‘Grandfather’, then the following command would result in the records listed after it. Each record written includes the entry itself and its parent.
combine -D ',' -w -d ',' -r test1.tmp -k 1 -m 1 -a 2 -D ',' \ -o 1-2 test2.tmp Grandfather, Father,Grandfather Me,Father Brother,Father Uncle,Grandfather Cousin,Uncle |
If we are only interested in the endpoints (in this case all the lowest-level descendants of ‘Grandfather’), we can use the option ‘-l’.
combine -D ',' -w -d ',' -r test1.tmp -k 1 -m 1 -a 2 -D ',' -o 1 \ -l test2.tmp Me Brother Cousin |
We can arrive at the same number of records, each containing the entire hierarchy
traversed to get to the leaf nodes, by using the option ‘--flatten-hierarchy’
(‘-F’). This option takes a number as an argument, and then includes
information from that many records found in traversing the hierarchy, starting from
the record that matched the data record. This example tells combine
to report
three levels from the matching ‘Grandfather’ record.
combine -D ',' -w -d ',' -r test1.tmp -k 1 -m 1 -a 2 -D ',' -o 1 \ -F 3 test2.tmp Grandfather,Father,Me Grandfather,Father,Brother Grandfather,Uncle,Cousin |
As with other areas within combine
, the hierarchy manipulation is extensible
through Guile. The key fields can be modified as with any other fields. See section Field-specific extensions, for details. The matches within the hierarchy
can be further filtered, using the ‘h’ suboption of the option ‘-x’.
(see section Extending combine
.) As with matches between reference records and data
this filtering can allow you to perform fuzzy comparisons, to do more complex
calculations to filter the match, or to decide when you have gone far enough and would
like to stop traversing the hierarchy.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated by Daniel P. Valentine on July 28, 2013 using texi2html 1.82.