5. Applying combine

There are lots of ways to apply combine to problems that will pop up in the course of day-to-day life. (Of course this fits into your day-to-day life a bit more if you have a job or hobby that regularly puts you into contact with data files.)

Here are a few topics explored in more detail.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.1 Rearranging Fields

When you do not use any reference files, combine still gives you the opportunity to create a new record layout based on the records you read.

This is an advantage over the cut utility because while cut only allows you to omit portions of the record, combine also allows you to reorder those fields you keep and to add a constant field somewhere in the order. In addition, combine gives you the chance to convert between fixed-width and delimited formats, where cut keeps the format you started with (although the GNU version does let you change delimiters).

Clearly, flexible tools like awk or sed or any programming language will also make this kind of thing (and with a little work anything combine can do) possible. It may be that they are a more efficient choice, but I have never tested it.

As an example, here is a fixed width file, which contains in its record layout some address book information. If I need to make a tab-delimited file of names and phone numbers to upload into my mobile phone, I can use the command that follows to get the output I want.

$ cat testadd.txt
2125551212Doe       John      123 Main StreetNew York  NY10001
2025551212Doe       Mary      123 Main StreetWashingtonDC20001
3015551212Doe       Larry     123 Main StreetLaurel    MD20707
6175551212Doe       Darryl    123 Main StreetBoston    MA02115
6035551212Doe       Darryl    123 Main StreetManchesterNH02020

Here is a command that grabs the first and last name and the phone number and tacks the word "Home" on the end so that my phone marks the number with a little house.(2)

Note that the statistics and the output all show up on the screen if you do not say otherwise. The statistics are on stderr and the output on stdout, so you can redirect them differently. You can also use the option ‘--output-file’ (or ‘-t’) to provide an output file, and you can suppress the statistics if you want with ‘--no-statistics’.

% combine --write-output --output-field-delimiter="	" \
          --output-fields=21-30,11-20,1-10 \
          --output-constant="Home" testadd.txt
Statistics for data file testadd.txt
  Number of records read:                            5
  Number of records dropped by filter:               0
  Number of records matched on key:                  5
  Number of records written:                         5
John	Doe	2125551212	Home
Mary	Doe	2025551212	Home
Larry	Doe	3015551212	Home
Darryl	Doe	6175551212	Home
Darryl	Doe	6035551212	Home

The delimiter between the quotes and in the output was a tab character, and it worked, but in some formats it comes out in a space when printed.

For reference, here is a comparable SQL query that would select the same data assuming a table were set up containing the data in the file above.

SELECT First_Name, Last_Name, Phone_Number, 'Home'
  FROM Address_Book;

A comparable gawk program would be something like this.

BEGIN {OFS = "\t"}
{ 
  print substr ($0, 21, 10), substr ($0, 11, 10), substr ($0, 1, 10), "Home";
  }

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.2 Aggregation

One feature of combine is aggregation. In other words, based on a set of keys in a data file, create a new file containing a summary record for each unique combination of values that appears in those keys.

This is a special case of reference-file based output in that the reference file and the data file are the same file. We are using it both to build the list of keys and to provide the values to summarize.

Here is a simple example. Suppose my 4-year-old were flipping Sacajawea(3) dollar coins and recording the results, along with the time and the date. A portion of the resulting file might look like this:

Monday	27 Oct 2003	14:02:01	Head	1.00
Monday	27 Oct 2003	14:03:05	Head	1.00
Monday	27 Oct 2003	14:03:55	Tail	1.00
Monday	27 Oct 2003	14:04:30	Head	1.00
Monday	27 Oct 2003	14:06:12	Tail	1.00
Monday	27 Oct 2003	14:08:43	Head	1.00
Monday	27 Oct 2003	14:54:52	Head	1.00
Monday	27 Oct 2003	19:59:59	Tail	1.00
Tuesday	28 Oct 2003	08:02:01	Tail	5.00
Tuesday	28 Oct 2003	08:02:16	Tail	5.00
Tuesday	28 Oct 2003	08:02:31	Head	5.00
Tuesday	28 Oct 2003	08:02:46	Tail	5.00
Wednesday	29 Oct 2003	12:02:09	Head	10.00

Then if I wanted to measure her daily performance, I could count the number of coin flips per day, with the following command.

The option ‘--data-is-reference’ tells the program to read the file only once, bulding the list of keys as it goes. This is useful for aggregations, and it requires that the key fields be the same and that we only keep one copy of each key (with the ‘--unique’ option).

The records come out in an unspecified order. If you require a specific order, you can pipe the result through the sort command.

% combine --input-field-delimiter="	" --data-is-reference \
          --count --reference-file=testcoin.txt \
          --output-field-delimiter="," -D "	" \
          --key-fields=2 --data-key-fields=2 --write-output \
          --output-fields=2 --unique testcoin.txt
28 Oct 2003,4
29 Oct 2003,1
27 Oct 2003,8
Statistics for reference file testcoin.txt
  Number of records read:                           13
  Number of records dropped by filter:               0
  Number of records stored:                          3
  Number of records matched on key:                  3
  Number of records matched fully:                   3
  Number of reference-data matches:                 13

An equivalent SQL statement, assuming that the table has been created would be.

SELECT Date, COUNT (Result)
  FROM Coinflip
  GROUP BY Date;

If she wanted to count the number of heads and tails, and also the amount she bet on each, she could do the following. (I have shortened the options you have already seen to their 1-byte equivalents.)

The specification of the field to sum says to keep 2 decimal places. combine works with integers if you don’t say otherwise. The option ‘--counter-size’ tells the program to use at least that many bytes to present counters and sums.

% combine -D "	" -M -n --sum-fields=5.2 --counter-size=8 \
          -r testcoin.txt -d "," -D "	" -k 4 -m 4 -w -o 4 \
          -u testcoin.txt
Head,       7,   20.00
Tail,       6,   18.00
Statistics for reference file testcoin.txt
  Number of records read:                           13
  Number of records dropped by filter:               0
  Number of records stored:                          2
  Number of records matched on key:                  2
  Number of records matched fully:                   2
  Number of reference-data matches:                 13

An equivalent SQL statement, assuming again that the table has been created would be.

SELECT Result, COUNT (Result), Sum (Wager)
  FROM Coinflip
  GROUP BY Result;

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.3 Finding Changes

If you get the same file again and again, and you want to see what’s different, the canonical method is to use the diff command. It will list for you those records that are unique to each file, and those that appear to have changed. One drawback is that it requires the records in the two files to be in the same order. Another is that it does not pay attention to the key fields that may be important to you.

When I want to find the changes in a file, including additions, deletions, and any changes for the information about a key I normally use 4 applications of combine. One command is suffucuent to find the new keys. One is enough to find those that are no longer there. A third finds everything that is different, and the fourth finds the before and after image of keys whose information has changed.

As an example, suppose my bridge club’s roster has a name and a phone number, separated by commas, for each member. The following four commands will identify the new members, the departed members, and those whose phone number has changed.

%# Reference file is optional, so unmatched data records will have
%# 2 blank fields at the end.
% combine -D , -d , -w -o 1-2 -r old_roster.txt -D , -k 1 -m 1 -p \
          -o 1-2 inew_roster.txt | grep -v ' , $' > new_member.txt
%# Data file is optional, so unmatched reference records will have
%# 2 blank fields at the beginning.
% combine -D , -d , -p -w -o 1-2 -r old_roster.txt -D , -k 1 -m 1 \
          -o 1-2 inew_roster.txt | grep -v '^ , ' > departed_member.txt
%# Both files are optional, so any unmatched records will have
%# blank fields at the beginning or end.  Here we match on the 
%#entire record rather than the key as in the above two examples.
% combine -D , -d , -p -w -o 1-2 -r old_roster.txt -D , \
          -k 1-2 -m 1-2 -o 1-2 -p inew_roster.txt | grep -v ' , ' \
          > all_changes.txt
%# Match up the before and after versions of an entry to see changes.
% combine -D , -d , -w -o 1-2 -r all_changes.txt -D , -k 3 -m 1 -o 2 \
          all_changes.txt | grep -v ' , ' > changed_member.txt

TO BE CONTINUED

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

5.4 Hierarchies

A hierarchy tends to be an organization where there is a directional one-to-many relationship between parent records and their associated children. combine works with hierarchies within reference files when the file has a record for each node and each record points to its parent in the hierarchy.

Because the hierarchy is assumed to be stored in a reference file, it is accessed by matching to a data record. Once an individual reference record has been matched to the data record, its relationship to other records within the hierarchy is followed through the hierarchy until there is no further to go.

The standard process is to assume that the key that matched to the data file key is at the top of the hierarchy. When traversing the hierarchy, combine looks for the key on the current record in the hierarchy key of other reference records. This repeats until there are no further linkages from one record to the next. For each record that is linked to the hierarchy, that record is treated as a reference record that matched the data record.

In this section, we’ll use the following hierarchy file. It is a simple hierarchy tree with ‘Grandfather’ as the top node and 2 levels of entries below.

Grandfather,
Father,Grandfather
Uncle,Grandfather
Me,Father
Brother,Father
Cousin,Uncle

If my data file consisted only of a record with the key ‘Grandfather’, then the following command would result in the records listed after it. Each record written includes the entry itself and its parent.

combine -D ',' -w -d ',' -r test1.tmp -k 1 -m 1 -a 2 -D ',' \
        -o 1-2 test2.tmp

Grandfather,   
Father,Grandfather
Me,Father
Brother,Father
Uncle,Grandfather
Cousin,Uncle

If we are only interested in the endpoints (in this case all the lowest-level descendants of ‘Grandfather’), we can use the option ‘-l’.

combine -D ',' -w -d ',' -r test1.tmp -k 1 -m 1 -a 2 -D ',' -o 1 \
        -l test2.tmp

Me
Brother
Cousin

We can arrive at the same number of records, each containing the entire hierarchy traversed to get to the leaf nodes, by using the option ‘--flatten-hierarchy’ (‘-F’). This option takes a number as an argument, and then includes information from that many records found in traversing the hierarchy, starting from the record that matched the data record. This example tells combine to report three levels from the matching ‘Grandfather’ record.

combine -D ',' -w -d ',' -r test1.tmp -k 1 -m 1 -a 2 -D ',' -o 1 \
        -F 3 test2.tmp

Grandfather,Father,Me
Grandfather,Father,Brother
Grandfather,Uncle,Cousin

As with other areas within combine, the hierarchy manipulation is extensible through Guile. The key fields can be modified as with any other fields. See section Field-specific extensions, for details. The matches within the hierarchy can be further filtered, using the ‘h’ suboption of the option ‘-x’. (see section Extending combine.) As with matches between reference records and data this filtering can allow you to perform fuzzy comparisons, to do more complex calculations to filter the match, or to decide when you have gone far enough and would like to stop traversing the hierarchy.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Daniel P. Valentine on July 28, 2013 using texi2html 1.82.