[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
combine
combine
is invoked with the command
combine [general options | data file options]… [reference file options]… [data files]
The various options are described in the following sections.
The arguments to the command are data files, an optional list of file
names to be processed or nothing, indicating that the input will come
from stdin
.
2.1 General Options | ||
2.2 File Options | ||
2.3 Data Files | ||
2.4 Reference Files | ||
2.5 Output Files | ||
2.6 Field Specifiers | ||
2.7 Emulation |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The general options are not specific to any particular file and apply to the entire current run of the program. They are listed below.
For each data record, check for a match with each of the reference files. This option is set by default. When turned off with ‘--no-check-all-reference’, processing of that data file record will be stopped as soon as a required match is not fulfilled. This is a shortcut for speed, but it will change the meaning of any flags or counters set in reference-file-based input.
In any output based on a reference file, print a flag with the value ‘1’ if the record was matched by a data record or ‘0’ otherwise.
In any output based on a reference file, print a new field which is a count of the number of data records that matched the current reference record.
If counters or sums of data fields are being written into reference-based output, make them number bytes long.
At the end of the processing, write to stderr
counts of the
number of records read and matched from each file processed. The option
is enabled by default. When turned off with ‘--no-statistics’
print no statistics.
Write all the information that is possible about the processing to
stderr
. This includes a summary of the options that have been
set and running counters of the number of records processed.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
There are some attributes that apply to any file. They include such things as the way to tell one record from another, the way to identify fields within a record, and the fields in a file that you want to write out.
Print string between each pair of fields written to output files.
This can be a comma, a tab character, your mother’s maiden name, or the
full text of the GNU General Public License. It is your choice. Of course,
it is most useful for it to be something that would not normally occur in
your data fields and something that your target system will understand.
Because combine
doesn’t care about such things, you can even use the
null character (\0
) or the newline character (\n
).
This means that there is to be no delimiter between fields in the output file or files. It will prevent the input field delimiter from being taken as the output field delimiter. In cases like this, it might behoove you either to ensure that the fields have a fixed length or to use Guile to insert your own delimiters before they are written. When specified along with ‘--output-field-delimiter’, the option that appears later will override the prior one.
Similarly here string is whatever string you need to use to tell
the records apart in output files. If you don’t specify it, combine
will assume that the usual text format applies, the newline on GNU
operating systems. (As of this writing, no port has been made to systems
where a newline is not the standard end-of-line character.)
Look for string in the associated input file to determine where
fields start and end. It can be any string, but it should be the string
that really separates fields in your file. Because combine
doesn’t
care about such things, you can even use the null character (\0
)
or the newline character (\n
). At present, there is no interpretation
of quoted strings. As a result, you may get unexpected results if the delimiter
shows up in the middle of one of your fields.
When specified for the data file, the value specified is used as the input
field delimiter for all reference files and as the output field delimiter for
any output files, unless other action is taken. The options
‘--no-input-field-delimiter’ and ‘--no-output-field-delimiter’
will tell combine
that a given file should not have a delimiter. The
options ‘--input-field-delimiter’ and ‘--output-field-delimiter’
will use whatever delimiter you specify rather than the data file input field
delimiter.
This means that there is no delimiter between fields in the specified input file or files. It will prevent the data file input field delimiter from being taken as the input field delimiter for a reference file. When specified along with ‘--output-field-delimiter’, the option that appears later will override the prior one.
Similarly here string is whatever string you need to use to tell
the records apart in input files. If you don’t specify it, combine
will assume that the usual text format applies, the newline on GNU
operating systems.
If the records are all of the same length (whether of not there is a
linefeed or a ‘W’ between them), you can specify a record length.
combine
will then split the file into records all of exactly
NUMBER bytes long. If there are linefeeds or carriage returns
between records, be sure to add them to the count. If the length is not
specified, combine
will assume the records have a delimiter between
them.
Make it unnecessary to find a match to this file for inclusion in data file based output and for the further processing of a data record. All matches are required by default.
When applied to a reference file, this means that records based on a data record may be written without a match to a key in this reference file, and any output from fields in this file will be left blank (or substituted if specified with the ‘-O’ option).
When applied to a data file, this means that in the file based on the data file records, records will be written at the end of processing for reference file records that were not matched by a data file record. Any output specified from fields in the data file will be left blank (or substituted if specified with the ‘-O’ option). If there is more than one reference file, care should be taken, because the unmatched records from each reference file will each become a record in the output file. This has no effect on reference-file-based output.
Require no match to this file for inclusion in data file based output and for the further processing of a data record. All matches are required and included by default. This differs from ‘--match-optional’ in that records that do have a match are excluded from further consideration.
When applied to a reference file, this means that records based on a data record may only be written when there is no match to a key in this reference file, and any output from fields in this file will be left blank (or substituted if specified with the ‘-O’ option).
When applied to a data file, this means that in the file based on the data file records, records will only be written at the end of processing for reference file records that were not matched by a data file record. Any output specified from fields in the data file will be left blank (or substituted if specified with the ‘-O’ option). If there is more than one reference file, care should be taken, because the unmatched records from each reference file will each become a record in the output file. This has no effect on reference-file-based output.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
A data file is either stdin
, one or more files, or both. If
there is more than one source, they should all share a format that fits
the specifications provided for output fields, fields holding numeric
values to be summed, and fields to be used as keys for matching.
The following are the commands that specifically affect the processing of the data file. Some share the same name with options that can also apply to reference files. Those must be on your command line before you present your first reference file. The others could appear anywhere on the command line; however, convention dictates that all the data-file options should be together before the first reference file is named.
Signals the program that output records should be written every time a data record satisfies the matching criteria with the reference files. The record written will combine all specified output fields from the data file record, the matching reference file record(s), and any specified constant values. This option is a positional option for all files, and must appear before the first reference file is named.
If provided, write the data-based output to filename. Otherwise
the output will go to stdout
. This option only makes sense if
you plan to write data-file-based output. This is a positional option
for all files and must appear before the first reference file is named.
Write the fields specified by range_string as part of the record
in any data-based output. The range specifications share a common
format with all field specifications for combine
. This option only
makes sense if you plan to write data-file-based output. This is a
positional option and to apply to the data file it must appear before
the first reference file is named.
Write string to the data-file-based output. This option only makes sense if you plan to write data-file-based output. This is a positional option and to apply to the data file it must appear before the first reference file is named.
In any reference-file-based output write a sum of the value of the fields specified by range_string. The sum for a given reference file record will come from all data file records that matched it. This option only makes sense if you have requested reference-file-based output.
If a precision is provided for a sum field, that precision is honored, and decimal fractions up to that many digits will be included in the sum. Any further precision on the input fields will be ignored (without rounding) in the sum. The resulting sum will be written with a decimal point and the number of fractional digits implied by the precision.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
A reference file record is expected to match on a set of key fields to a data file record. The parts of a reference file that are necessary for processing are read entirely into memory. You can specify as many reference files as you want, depending only on the amount of memory your system can spare to hold them. For any reference file, it is minimally required that you specify a file name, a specification of the key fields in the reference file, and a specification of the matching key fields in the data file.
The following are the options that are related to reference files. They are all positional, and they apply to the processing of the previously named reference file. (Except of course for the reference file name itself, which applies to itself.)
Use filename as a reference file to match to the data file in
processing. This option introduces a block of positional options that
relate to this reference file’s processing in combine
.
Use the fields specified by range_string as a key to match to a corresponding key in the data file.
Use the fields specified by range_string as the corresponding key to a key taken from a reference file.
Use the fields specified by range_string as a key to perform a recursive hierarchical match within the reference file. This key will be matched against values specified in the regular key on the reference file.
Keep only one record for the reference file in memory for each distinct
key. By default combine
maintains all the records from the
reference file in memory for processing. This default allows for
cartesian products when a key exists multiple times in both the
reference and data files.
Use the number provided as a base size for allocating a hash table
to store the records from this reference file. If this number is too
small, combine
will fail when it tries to record a record it has no
room for. If it is only a little bit too small, it will cause
inefficiency as searching for open space in the hash table will be
difficult.
One of the keywords binary
, number
, beginning
,
or end
, indicating how to turn the key into a number with
the best variability and least overlap. The wise choice of this option
can cut processing time significantly. The binary option is the default,
and treats the last few bytes (8 on most computers) of the key string(s)
as a big number. The number option converts the entire key to a number
assuming it is a numeric string. The other two take the least significant
3 bits from each of the first or last few (21 where a 64 bit integer is
available) bytes in the key strings and turns them into a number.
Signals the program that output records should be written for every record stored for this reference file. This will either be one record for every record in the reference file or one record for every distinct set of keys in the reference file, depending on the setting of the option ‘--unique’. The record written will include all specified output fields from the reference file record, any specified constant value for this reference file, and any flag, counter, or sums requested.
If provided, write the output based on this reference file to
filename. Otherwise the output will go to stdout
. This
option only makes sense if you plan to write output based on this
reference file.
Write the fields specified by range_string as part of the record
in any reference-file- or data-file-based output. The range
specifications share a common format with all field specifications for
combine
.
Write string to the reference- or data-file-based output.
When traversing the hierarchy from a given reference-file record, use the values on that record in the ‘--hierarchy-key-fields’ fields to connect to the ‘--key-fields’ fields of other records from the reference file. For most purposes, the presence of the connection on the first record suggests a single parent in a standard hierarchy. The hierarchy traversal stops when the ‘--hierarchy-key-fields’ fields are empty.
If this option is not set, the ‘--key-fields’ fields are used to search for the same values in the ‘--hierarchy-key-fields’ fields of other records in the same file. This allows multiple children of an initial record, and suggests going down in the hierarchy. The hierarchy traversal stops when no further connection can be made. The traversal is depth-first.
When traversing a hierarchy, treat only the endpoints as matching records. Nodes that have onward connections are ignored except for navigating to the leaf nodes.
When traversing a hierarchy, act as the ‘hierarchy-leaf-only’, except save information about the intervening nodes. Repeat the ‘output-fields’ fields number times (leaving them blank if there were fewer levels), starting from the first reference record matched.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
There are two basic kinds of output files: one based on the data records and reference records that match them, the other based on a full set of records from one reference file with flags, counts, or sums based on the aggregate of the matching data records.
The output file based on the data file consists of information from the data file records and any matching reference file records. The records that go into data-based output files can be figured out as follows:
If there is no reference file, there will be one record for every record
in the data file, with the exception of any records that were elimitated
through an extension filter. (see section Extending combine
.)
If all reference files are specified with the ‘--unique’ and ‘--match-optional’ options, then the records selected for insertion into the data-based output file will be the same as those that would be selected without a reference file.
If a reference file is not given the ‘--unique’ option and there is more than one reference record that matches a given data record, then the data record will be represented more than once in the output file, each time combined with information from a different matching reference record. If there is more than one reference file where this is the case, the result will be multiplicative (e.g. 2 matches in each of 2 reference files will produce 4 records). This is the default setting for a reference file.
If a reference file is not given the ‘--match-optional’ option, then any data record that does not have a match in the reference file will not be represented in the output file. This is the default setting.
The fields that can appear in data=file-based output can come from the data-file record and any matching reference file records.
Reference=file-based output files are simpler. Depending on the existence or not of the ‘--unique’ option, the file will have an entry for each of the unique keys or for each of the records in the reference file, respectively.
The fields in the reference=file-based output are exclusively from the reference file, except for optional fields that can be summarized from fields on matching data-file records.
The order of the fields in an output record can either be according to the default or it can be explicitly specified by the user.
In data=file-based output, the standard field order is as follows. All the fields listed are printed in this order if they are specified. If an output field delimiter is specified, it is put between every pair of adjacent fields. If there is no match for a given reference file (and the ‘--match-optional’ option is set for the file), all the fields that would normally be provided by that file are filled with spaces for fixed-width fields or zero-length for delimited output.
In reference=file-based output, the standard field order is as follows. All the fields listed are printed in this order if they are specified. If an output field delimiter is specified, it is put between every pair of adjacent fields.
The order of the fields in any output file can be customized using the ‘--field-order’ (or ‘-O’) option. The argument for the option is a comma-separated list of field identifiers. Each field identifier has 2 parts, a source and a type, separated by a period (.).
The sources are composed of an ‘r’ for reference file or ‘d’ for data file followed by an optional number. The number indicates which reference file the field comes from and is ignored for data files. Without a number, the first of each is taken.
A third source ‘s’ represents a substitution in the event that the
preceding reference file field could not be provided because there was
no match between that reference file and the data file. The number
following it, if blank or zero, tells combine
to take the field from
the data file. Any other number means the corresponding reference file.
This allows the conditional update of fields from the data file, or a
prioritization of selections from a variety of reference files. If you
are working with fixed-width fields, you should ensure that the lengths
of the various fields in the substitution chain are the same.
The types are composed similarly. The identifiers are listed below. The number is ignored for identifiers of string constants, flags, and counters. For output fields, a hyphen-separated range of fields can be written to avoid having to write a long list. Any number provided is the number of the field in the order it was specified in the ‘-o’ or ‘-s’ option on the command line. In delimited-field files this may differ from the field number used in those options.
Output fields from either reference or data files.
String constant for either reference or data files.
Flag (1/0) for reference files.
Counter for reference files.
Sum field for reference files.
Here is an example:
--field-order d.o1,d.o2,d.o3,d.k,r1.o1,s2.o1,s0.o4 --field-order d.o1-3,d.k,r1.o1,s2.o1,s0.o4 |
In this case, the first three fields from the data file are followed by the constant string from the data file. Then, if there was a match to reference file 1, the first field from that file is taken, otherwise if there was a match to reference file 2, the first field from that file is taken. If neither file matched the data record, the fourth field from the data record is taken instead.
The second line is equivalent, using a range of fields for convenience.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
There are a number of options that require a list of fields from one of the files. All take an argument consisting of a comma-separated list of individual field specifications.
The individual field specifications are of the form ‘s-e.p(instruction)’. As a standard (with no field delimiter given for the file), ‘s’ is a number indicating the starting position of the field within the records in th given file, and ‘e’ is a number pointing out the ending position. If there is no hyphen, the single number ‘s’ represents a one-byte field at position ‘s’. If the first number and a hyphen exist without the number ‘e’, then the field is assumed to start at ‘s’ and extend to the end of the record. The numbering of the bytes in the record starts at 1.
If you do provide a delimiter for identifying the borders between fields
in your file, then combine
does not need to count bytes, and you
just need to provide the field number (starting at 1) of the fields you
want to declare. Currently, giving a range of fields results in
combine
acting as though you listed every field from the lower end
to the upper end of the range individually. Any precision or
extensibility information you provide will be applied to each field in
the range.
The optional ‘.p’ gives the number of digits of precision with which the field should be considered. At present its only practical use is in fields to be summed from data files, where it is used in converting between string and number formats.
The optional ‘(instruction)’ is intended for extensibility. For output fields and key fields it is a scheme command that should be run when the field is read into the system. If the associated field specification is a key for matching reference and data records, this can affect the way the records are matched. If it is a field to be written to an output file, it will affect what the user sees.
To refer to the field actually read from the source data within the scheme command, use the following sceme variable in the appropriate place in the command: ‘input-field’
Scheme commands attached to individual fields need to return a character string and have available only the current text of the field as described above(1). In delimited output, feel free to alter the length of the return string at will. In fixed-width output, it will be your responsibility to ensure that you return a string of a useful length if you want to maintain the fixed structure.
In the following string,
1-10,15-20.2,30(HiMom input-field),40- |
The fist field runs from position 1 to 10 in the record. The second from
15 to 20, with a stated (but possibly meaningless) precision of 2. The
third is the single byte at position 30, to be adjusted by replacing it
with whatever the scheme function HiMom
does to it. The fourth
field starts at position 40 and extends to the end of the record.
In a delimited example, the following two sets of options are equivalent.
-D ',' -o 4,5,6,7,8 -D ',' -o 4-8 |
In both cases the fourth, fifth, sixth, seventh, and eighth comma-delimited fields are each treated as an individual field for processing. Note that if you also provide a field order (with the ‘-O’ option), that order would refer to these fields as 1, 2, 3, 4, and 5 because of the order in which you specified them, not caring about the order in the original file.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Because of its simularity with join
, combine
has an emulation
mode that allows you to use the syntax of the join
command to start
combine
.
The emulation can be started by using ‘--emulate join’ option as the first option after the command name. After that, you can use the same options you get with join.
For details of the join
command, see See (coreutils.info)join invocation section ‘Join Invocation’ in GNU Coreutils Manual.
When tested against the test cases packaged with GNU Coreutils, results are
identical to join with a couple of exceptions:
combine
produces records in the order
of the second input file, with any unmatched records from the first input file
following in an arbitrary order.
combine
can handle the space-delimited list, but the standard argument handler
getopt_long
does not interpret them as a single argument. I don’t see the
need to overcome that.
There are also a number of features of combine
that come through in the
emulation. The main features relate to the keys: the sort order of the records
in relation to the keys does not matter to combine
. combine
also
allows you to specify a list of key fields (comma-delimited) rather than just
one as arguments to ‘-1’, ‘-2’, and ‘-j’. You should make
sure that the number of key fields is the same.
Another feature is that the second input file can actually be as many files as you want. That way you can avoid putting the records from several files together if not otherwise necessary.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated by Daniel P. Valentine on July 28, 2013 using texi2html 1.82.