The name mdiff
stands for multi-diff
, and has the
purpose of encompassing the functionnality of a few other diff
-type
programs. The prefix multi- also stands for the fact the program
is often able to study more than two input files at once.
The theory of operation is simple. The program splits all input files
into a sequence of items, which may be lines or words. mdiff
is
then said to operate either in line mode or in word mode.
It then tries to find sequences of items which are repeated in the
input files. Such common sequences are called clusters of items,
and each occurrence of a repetition is called a cluster member.
What remains, once all cluster members are conceptually removed from all
input files, is a set of differences. The role of mdiff
is to conveniently list either cluster members and differences.
When input files are very similar, it is likely that clusters will encompass many items (lines or words) and differences will be small. So, most listing options inhibit the printing of cluster members. However, one may ask for the few beginning or ending items of cluster members to be printed nevertheless, as a way to provide a kind of feedback or context of the difference, those context items are sometimes said to be at the horizon of the difference. In merged listings, cluster members may just not be printed, except maybe for a few context items at the beginning of the member (just after a difference), and a few context items at the end of the member (just before a difference).
When cluster members are short, or if you prefer, when the differences
are not far away from each other, it is quite possible that the required
context items often cover the full extent of the cluster members, which
then are not inhibited anymore when this happens. A run of differences
intermixed with such non-suppressed members is called a hunk.
Some reports produced by mdiff
are showned as a list of hunks,
and it is to be understood that common items are elided between hunks.
However, each hunk in itself has no item missing, and each item of the
hunk is analysed as pertaining either to only one of the input file or
to many of them. Each hunk is preceded by a header, which explains the
line position of all input files prior to the hunk itself. By comparing
a hunk header with the previous hunk header, the user can have a hint
about how much printing was spared.
When two input files are quite similar, clusters are usually presented in
the same order in all files. If a cluster member A in the first
file corresponds to a cluster member A in the second file, it is
likely that another cluster member B which appears after
A in the first file will correspond to a cluster member B
in the second file which appears after A as well. So, in
many cases, while producing merged listing of files, cluster members may
be made to naturally correspond to one another. However, this is not
always true, in particular when the second file has been produced from
the first by moving a big chunk of code away from its original position.
In such cases, we say that members have crossed. When members are
crossed and mdiff
has to make a merged listing, it selects one
cluster member as being naturally associated with its correspondant
(either the pair of A’s or the pair of B’s) and then consider
the other cluster as being part of a difference. The crossed nature of
the member may still be analysed and reported, or it may be ignored.
The standard diff
program is meant for when there are exactly two
input files, for which crossed members should be ignored. mdiff
output format has been designed in such a way that it should resemble
diff
output for this precise case. However, diff
formats are
not sufficient for representing all cases which mdiff
may address,
and this is not mature yet. That is why mdiff
, in its current
state, still experiments with output formats, which are subject to change.
When the input files are not very similar, or rather different, merged
listings are not very significant nor useful, and may even be rather
confusing. The best to do in such cases is using mdiff
for making
an annotated relisting of all input files, in which cluster members are
properly identified and referred to one another.
Statistics.
Read summary: 137 files, 41975 lines Work summary: 439 clusters, 1608 members, 8837 duplicate lines
The summary lines, triggered by the -s option, say that about 8837
non-ignorable lines could be removed over the 41975 which has been read,
by using functions, #include
, #define
, or similar devices.
If one manages to execute mdiff
within GNU Emacs so the output
described above is collected into the *compilation*
buffer, the
command C-` (‘M-x next-error’) will proceed to the next
cluster member in the other window, and similarily for other compilation
mode commands. This is a useful way for handling mdiff
output.
Each line in the hunk, after the header, comes from the compared files, but is shifted right so the first column (or the first few columns) of each line gives information about where the line is coming from. A space indicates a line which is common to all files. In case there are only two input files, a minus sign indicates a line from the first file and a plus sign a line from the second file. Else, a letter from ‘a’ to ‘z’, or more than one letter if there are more than 26 files, indicates to which file the line pertains. If a line or a block of line pertains to many files but not to all of them, the first column holds a vertical bar, and the line or block of lines is bracketed between ‘@/’ and ‘@\’ lines, which are kind of comments within the hunk. The initial bracket lists all file letters that are related to the incoming line.
I initially wrote mdiff
specifically to help cleaning a C++
project which was a bit large, and in which many big monolithic classes
were derived from each other most probably by rough copying followed by
local modifications. I intended to fragment most common clusters and
segregate the parts into virtual methods in outer classes, and override
these methods, as appropriate, with less common variants within inner
classes. mdiff
was good at pointing me to exactly where I should
look at. Of course, it never did the cleanup for me, but it helped doing
the research about what should be done. Reusing mdiff
over the
half-cleaned project gave me more fine grained analysis of what was left
to consider.
• mdiff invocation: | Invoking mdiff
| |
• Efficiency: | Resource considerations and efficiency |