2. Invoking `combine`

combine is invoked with the command

combine [general options | data file options]… [reference file options]… [data files]

The various options are described in the following sections.

The arguments to the command are data files, an optional list of file names to be processed or nothing, indicating that the input will come from stdin.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

2.1 General Options

The general options are not specific to any particular file and apply to the entire current run of the program. They are listed below.

‘-R’
‘--check-all-reference’
‘--no-check-all-reference’: For each data record, check for a match with each of the reference files. This option is set by default. When turned off with ‘--no-check-all-reference’, processing of that data file record will be stopped as soon as a required match is not fulfilled. This is a shortcut for speed, but it will change the meaning of any flags or counters set in reference-file-based input.
‘-f’
‘--flag’: In any output based on a reference file, print a flag with the value ‘1’ if the record was matched by a data record or ‘0’ otherwise.
‘-n’
‘--count’: In any output based on a reference file, print a new field which is a count of the number of data records that matched the current reference record.
‘-z number’
‘--counter-size=number’: If counters or sums of data fields are being written into reference-based output, make them number bytes long.
‘--statistics’
‘--no-statistics’: At the end of the processing, write to stderr counts of the number of records read and matched from each file processed. The option is enabled by default. When turned off with ‘--no-statistics’ print no statistics.
‘--verbose’: Write all the information that is possible about the processing to stderr. This includes a summary of the options that have been set and running counters of the number of records processed.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

2.2 File Options

There are some attributes that apply to any file. They include such things as the way to tell one record from another, the way to identify fields within a record, and the fields in a file that you want to write out.

‘-d string’

‘--output-field-delimiter=string’

Print string between each pair of fields written to output files. This can be a comma, a tab character, your mother’s maiden name, or the full text of the GNU General Public License. It is your choice. Of course, it is most useful for it to be something that would not normally occur in your data fields and something that your target system will understand. Because combine doesn’t care about such things, you can even use the null character (\0) or the newline character (\n).

‘--no-output-field-delimiter’

This means that there is to be no delimiter between fields in the output file or files. It will prevent the input field delimiter from being taken as the output field delimiter. In cases like this, it might behoove you either to ensure that the fields have a fixed length or to use Guile to insert your own delimiters before they are written. When specified along with ‘--output-field-delimiter’, the option that appears later will override the prior one.

‘-b string’

‘--output-record-delimiter=string’

Similarly here string is whatever string you need to use to tell the records apart in output files. If you don’t specify it, combine will assume that the usual text format applies, the newline on GNU operating systems. (As of this writing, no port has been made to systems where a newline is not the standard end-of-line character.)

‘-D string’

‘--input-field-delimiter=string’

Look for string in the associated input file to determine where fields start and end. It can be any string, but it should be the string that really separates fields in your file. Because combine doesn’t care about such things, you can even use the null character (\0) or the newline character (\n). At present, there is no interpretation of quoted strings. As a result, you may get unexpected results if the delimiter shows up in the middle of one of your fields.

When specified for the data file, the value specified is used as the input field delimiter for all reference files and as the output field delimiter for any output files, unless other action is taken. The options ‘--no-input-field-delimiter’ and ‘--no-output-field-delimiter’ will tell combine that a given file should not have a delimiter. The options ‘--input-field-delimiter’ and ‘--output-field-delimiter’ will use whatever delimiter you specify rather than the data file input field delimiter.

‘--no-input-field-delimiter’

This means that there is no delimiter between fields in the specified input file or files. It will prevent the data file input field delimiter from being taken as the input field delimiter for a reference file. When specified along with ‘--output-field-delimiter’, the option that appears later will override the prior one.

‘-B string’

‘--input-record-delimiter=string’

Similarly here string is whatever string you need to use to tell the records apart in input files. If you don’t specify it, combine will assume that the usual text format applies, the newline on GNU operating systems.

‘-L number’

‘--input-record-length=number’

If the records are all of the same length (whether of not there is a linefeed or a ‘W’ between them), you can specify a record length. combine will then split the file into records all of exactly NUMBER bytes long. If there are linefeeds or carriage returns between records, be sure to add them to the count. If the length is not specified, combine will assume the records have a delimiter between them.

‘-p’

‘--match-optional’

Make it unnecessary to find a match to this file for inclusion in data file based output and for the further processing of a data record. All matches are required by default.

When applied to a reference file, this means that records based on a data record may be written without a match to a key in this reference file, and any output from fields in this file will be left blank (or substituted if specified with the ‘-O’ option).

When applied to a data file, this means that in the file based on the data file records, records will be written at the end of processing for reference file records that were not matched by a data file record. Any output specified from fields in the data file will be left blank (or substituted if specified with the ‘-O’ option). If there is more than one reference file, care should be taken, because the unmatched records from each reference file will each become a record in the output file. This has no effect on reference-file-based output.

‘-P’

‘--exclude-match’

Require no match to this file for inclusion in data file based output and for the further processing of a data record. All matches are required and included by default. This differs from ‘--match-optional’ in that records that do have a match are excluded from further consideration.

When applied to a reference file, this means that records based on a data record may only be written when there is no match to a key in this reference file, and any output from fields in this file will be left blank (or substituted if specified with the ‘-O’ option).

When applied to a data file, this means that in the file based on the data file records, records will only be written at the end of processing for reference file records that were not matched by a data file record. Any output specified from fields in the data file will be left blank (or substituted if specified with the ‘-O’ option). If there is more than one reference file, care should be taken, because the unmatched records from each reference file will each become a record in the output file. This has no effect on reference-file-based output.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

2.3 Data Files

A data file is either stdin, one or more files, or both. If there is more than one source, they should all share a format that fits the specifications provided for output fields, fields holding numeric values to be summed, and fields to be used as keys for matching.

The following are the commands that specifically affect the processing of the data file. Some share the same name with options that can also apply to reference files. Those must be on your command line before you present your first reference file. The others could appear anywhere on the command line; however, convention dictates that all the data-file options should be together before the first reference file is named.

‘-w’

‘--write-output’

Signals the program that output records should be written every time a data record satisfies the matching criteria with the reference files. The record written will combine all specified output fields from the data file record, the matching reference file record(s), and any specified constant values. This option is a positional option for all files, and must appear before the first reference file is named.

‘-t filename’

‘--output-file=filename’

If provided, write the data-based output to filename. Otherwise the output will go to stdout. This option only makes sense if you plan to write data-file-based output. This is a positional option for all files and must appear before the first reference file is named.

‘-o range_string’

‘--output-fields=range_string’

Write the fields specified by range_string as part of the record in any data-based output. The range specifications share a common format with all field specifications for combine. This option only makes sense if you plan to write data-file-based output. This is a positional option and to apply to the data file it must appear before the first reference file is named.

‘-K string’

‘--output-constant=string’

Write string to the data-file-based output. This option only makes sense if you plan to write data-file-based output. This is a positional option and to apply to the data file it must appear before the first reference file is named.

‘-s range_string’

‘--sum-fields=range_string’

In any reference-file-based output write a sum of the value of the fields specified by range_string. The sum for a given reference file record will come from all data file records that matched it. This option only makes sense if you have requested reference-file-based output.

If a precision is provided for a sum field, that precision is honored, and decimal fractions up to that many digits will be included in the sum. Any further precision on the input fields will be ignored (without rounding) in the sum. The resulting sum will be written with a decimal point and the number of fractional digits implied by the precision.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

2.4 Reference Files

A reference file record is expected to match on a set of key fields to a data file record. The parts of a reference file that are necessary for processing are read entirely into memory. You can specify as many reference files as you want, depending only on the amount of memory your system can spare to hold them. For any reference file, it is minimally required that you specify a file name, a specification of the key fields in the reference file, and a specification of the matching key fields in the data file.

The following are the options that are related to reference files. They are all positional, and they apply to the processing of the previously named reference file. (Except of course for the reference file name itself, which applies to itself.)

‘-r filename’

‘--reference-file=filename’

Use filename as a reference file to match to the data file in processing. This option introduces a block of positional options that relate to this reference file’s processing in combine.

‘-k range_string’

‘--key-fields=range_string’

Use the fields specified by range_string as a key to match to a corresponding key in the data file.

‘-m range_string’

‘--data-key-fields=range_string’

Use the fields specified by range_string as the corresponding key to a key taken from a reference file.

‘-a range_string’

‘--hierarchy-key-fields=range_string’

Use the fields specified by range_string as a key to perform a recursive hierarchical match within the reference file. This key will be matched against values specified in the regular key on the reference file.

‘-u’

‘--unique’

Keep only one record for the reference file in memory for each distinct key. By default combine maintains all the records from the reference file in memory for processing. This default allows for cartesian products when a key exists multiple times in both the reference and data files.

‘-h number’

‘--hash-size=number’

Use the number provided as a base size for allocating a hash table to store the records from this reference file. If this number is too small, combine will fail when it tries to record a record it has no room for. If it is only a little bit too small, it will cause inefficiency as searching for open space in the hash table will be difficult.

‘-H keyword’

‘--hash-movement=keyword’

One of the keywords binary, number, beginning, or end, indicating how to turn the key into a number with the best variability and least overlap. The wise choice of this option can cut processing time significantly. The binary option is the default, and treats the last few bytes (8 on most computers) of the key string(s) as a big number. The number option converts the entire key to a number assuming it is a numeric string. The other two take the least significant 3 bits from each of the first or last few (21 where a 64 bit integer is available) bytes in the key strings and turns them into a number.

‘-w’

‘--write-output’

Signals the program that output records should be written for every record stored for this reference file. This will either be one record for every record in the reference file or one record for every distinct set of keys in the reference file, depending on the setting of the option ‘--unique’. The record written will include all specified output fields from the reference file record, any specified constant value for this reference file, and any flag, counter, or sums requested.

‘-t filename’

‘--output-file=filename’

If provided, write the output based on this reference file to filename. Otherwise the output will go to stdout. This option only makes sense if you plan to write output based on this reference file.

‘-o range_string’

‘--output-fields=range_string’

Write the fields specified by range_string as part of the record in any reference-file- or data-file-based output. The range specifications share a common format with all field specifications for combine.

‘-K string’

‘--output-constant=string’

Write string to the reference- or data-file-based output.

‘-U’

‘--up-hierarchy’

When traversing the hierarchy from a given reference-file record, use the values on that record in the ‘--hierarchy-key-fields’ fields to connect to the ‘--key-fields’ fields of other records from the reference file. For most purposes, the presence of the connection on the first record suggests a single parent in a standard hierarchy. The hierarchy traversal stops when the ‘--hierarchy-key-fields’ fields are empty.

If this option is not set, the ‘--key-fields’ fields are used to search for the same values in the ‘--hierarchy-key-fields’ fields of other records in the same file. This allows multiple children of an initial record, and suggests going down in the hierarchy. The hierarchy traversal stops when no further connection can be made. The traversal is depth-first.

‘-l’

‘--hierarchy-leaf-only’

When traversing a hierarchy, treat only the endpoints as matching records. Nodes that have onward connections are ignored except for navigating to the leaf nodes.

‘-F number’

‘--flatten-hierarchy=number’

When traversing a hierarchy, act as the ‘hierarchy-leaf-only’, except save information about the intervening nodes. Repeat the ‘output-fields’ fields number times (leaving them blank if there were fewer levels), starting from the first reference record matched.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

2.5 Output Files

There are two basic kinds of output files: one based on the data records and reference records that match them, the other based on a full set of records from one reference file with flags, counts, or sums based on the aggregate of the matching data records.

The output file based on the data file consists of information from the data file records and any matching reference file records. The records that go into data-based output files can be figured out as follows:

no reference file: If there is no reference file, there will be one record for every record in the data file, with the exception of any records that were elimitated through an extension filter. (see section Extending combine.)
reference files with ‘--unique’ and ‘--match-optional’ options: If all reference files are specified with the ‘--unique’ and ‘--match-optional’ options, then the records selected for insertion into the data-based output file will be the same as those that would be selected without a reference file.
reference files without the ‘--unique’ option: If a reference file is not given the ‘--unique’ option and there is more than one reference record that matches a given data record, then the data record will be represented more than once in the output file, each time combined with information from a different matching reference record. If there is more than one reference file where this is the case, the result will be multiplicative (e.g. 2 matches in each of 2 reference files will produce 4 records). This is the default setting for a reference file.
reference files without the ‘--match-optional’ option: If a reference file is not given the ‘--match-optional’ option, then any data record that does not have a match in the reference file will not be represented in the output file. This is the default setting.

The fields that can appear in data=file-based output can come from the data-file record and any matching reference file records.

Reference=file-based output files are simpler. Depending on the existence or not of the ‘--unique’ option, the file will have an entry for each of the unique keys or for each of the records in the reference file, respectively.

The fields in the reference=file-based output are exclusively from the reference file, except for optional fields that can be summarized from fields on matching data-file records.

The order of the fields in an output record can either be according to the default or it can be explicitly specified by the user.

In data=file-based output, the standard field order is as follows. All the fields listed are printed in this order if they are specified. If an output field delimiter is specified, it is put between every pair of adjacent fields. If there is no match for a given reference file (and the ‘--match-optional’ option is set for the file), all the fields that would normally be provided by that file are filled with spaces for fixed-width fields or zero-length for delimited output.

All the data-file output fields (in order)
The constant string set for the data file
For each reference file
- - The constant string set for the reference file
- - All the reference-file output fields

In reference=file-based output, the standard field order is as follows. All the fields listed are printed in this order if they are specified. If an output field delimiter is specified, it is put between every pair of adjacent fields.

All the reference-file output fields OR the key fields if no output fields are given
A 1/0 flag indicating whether there was any match
A counter of the number of data records matched
A sum of each of the data-file sum fields from each matching data-file record

The order of the fields in any output file can be customized using the ‘--field-order’ (or ‘-O’) option. The argument for the option is a comma-separated list of field identifiers. Each field identifier has 2 parts, a source and a type, separated by a period (.).

The sources are composed of an ‘r’ for reference file or ‘d’ for data file followed by an optional number. The number indicates which reference file the field comes from and is ignored for data files. Without a number, the first of each is taken.

A third source ‘s’ represents a substitution in the event that the preceding reference file field could not be provided because there was no match between that reference file and the data file. The number following it, if blank or zero, tells combine to take the field from the data file. Any other number means the corresponding reference file. This allows the conditional update of fields from the data file, or a prioritization of selections from a variety of reference files. If you are working with fixed-width fields, you should ensure that the lengths of the various fields in the substitution chain are the same.

The types are composed similarly. The identifiers are listed below. The number is ignored for identifiers of string constants, flags, and counters. For output fields, a hyphen-separated range of fields can be written to avoid having to write a long list. Any number provided is the number of the field in the order it was specified in the ‘-o’ or ‘-s’ option on the command line. In delimited-field files this may differ from the field number used in those options.

‘o’: Output fields from either reference or data files.
‘k’: String constant for either reference or data files.
‘f’: Flag (1/0) for reference files.
‘n’: Counter for reference files.
‘s’: Sum field for reference files.

Here is an example:

--field-order d.o1,d.o2,d.o3,d.k,r1.o1,s2.o1,s0.o4

--field-order d.o1-3,d.k,r1.o1,s2.o1,s0.o4

In this case, the first three fields from the data file are followed by the constant string from the data file. Then, if there was a match to reference file 1, the first field from that file is taken, otherwise if there was a match to reference file 2, the first field from that file is taken. If neither file matched the data record, the fourth field from the data record is taken instead.

The second line is equivalent, using a range of fields for convenience.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

2.6 Field Specifiers

There are a number of options that require a list of fields from one of the files. All take an argument consisting of a comma-separated list of individual field specifications.

The individual field specifications are of the form ‘s-e.p(instruction)’. As a standard (with no field delimiter given for the file), ‘s’ is a number indicating the starting position of the field within the records in th given file, and ‘e’ is a number pointing out the ending position. If there is no hyphen, the single number ‘s’ represents a one-byte field at position ‘s’. If the first number and a hyphen exist without the number ‘e’, then the field is assumed to start at ‘s’ and extend to the end of the record. The numbering of the bytes in the record starts at 1.

If you do provide a delimiter for identifying the borders between fields in your file, then combine does not need to count bytes, and you just need to provide the field number (starting at 1) of the fields you want to declare. Currently, giving a range of fields results in combine acting as though you listed every field from the lower end to the upper end of the range individually. Any precision or extensibility information you provide will be applied to each field in the range.

The optional ‘.p’ gives the number of digits of precision with which the field should be considered. At present its only practical use is in fields to be summed from data files, where it is used in converting between string and number formats.

The optional ‘(instruction)’ is intended for extensibility. For output fields and key fields it is a scheme command that should be run when the field is read into the system. If the associated field specification is a key for matching reference and data records, this can affect the way the records are matched. If it is a field to be written to an output file, it will affect what the user sees.

To refer to the field actually read from the source data within the scheme command, use the following sceme variable in the appropriate place in the command: ‘input-field’

Scheme commands attached to individual fields need to return a character string and have available only the current text of the field as described above(1). In delimited output, feel free to alter the length of the return string at will. In fixed-width output, it will be your responsibility to ensure that you return a string of a useful length if you want to maintain the fixed structure.

In the following string,

1-10,15-20.2,30(HiMom input-field),40-

The fist field runs from position 1 to 10 in the record. The second from 15 to 20, with a stated (but possibly meaningless) precision of 2. The third is the single byte at position 30, to be adjusted by replacing it with whatever the scheme function HiMom does to it. The fourth field starts at position 40 and extends to the end of the record.

In a delimited example, the following two sets of options are equivalent.

-D ',' -o 4,5,6,7,8
-D ',' -o 4-8

In both cases the fourth, fifth, sixth, seventh, and eighth comma-delimited fields are each treated as an individual field for processing. Note that if you also provide a field order (with the ‘-O’ option), that order would refer to these fields as 1, 2, 3, 4, and 5 because of the order in which you specified them, not caring about the order in the original file.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

2.7 Emulation

Because of its simularity with join, combine has an emulation mode that allows you to use the syntax of the join command to start combine.

The emulation can be started by using ‘--emulate join’ option as the first option after the command name. After that, you can use the same options you get with join.

For details of the join command, see See (coreutils.info)join invocation section ‘Join Invocation’ in GNU Coreutils Manual. When tested against the test cases packaged with GNU Coreutils, results are identical to join with a couple of exceptions:

The sort order can be different. combine produces records in the order of the second input file, with any unmatched records from the first input file following in an arbitrary order.
I change the arguments to the ‘-o’ option in the test script to have quotes around the field order list when it is separated by spaces. combine can handle the space-delimited list, but the standard argument handler getopt_long does not interpret them as a single argument. I don’t see the need to overcome that.
There is not a specific test, but I have not yet implemented case-insensitive matching. It would have failed if tested.
One more option that is not implemented in the emulation are the ‘-j1’ and ‘-j2’ methods of specifying separate keys for the two files. Use ‘-1’ and ‘-2’ for that. The two obsolete options would be interpreted as speciying field 1 or field 2 as the common join key, respectively.

There are also a number of features of combine that come through in the emulation. The main features relate to the keys: the sort order of the records in relation to the keys does not matter to combine. combine also allows you to specify a list of key fields (comma-delimited) rather than just one as arguments to ‘-1’, ‘-2’, and ‘-j’. You should make sure that the number of key fields is the same.

Another feature is that the second input file can actually be as many files as you want. That way you can avoid putting the records from several files together if not otherwise necessary.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Daniel P. Valentine on July 28, 2013 using texi2html 1.82.

2. Invoking combine

2.1 General Options

2.2 File Options

2.3 Data Files

2.4 Reference Files

2.5 Output Files

2.6 Field Specifiers

2.7 Emulation

2. Invoking `combine`