Most people who make heavy use of Internet resources have a large
bookmark file with pointers to interesting web sites. It is impossible
to regularly check by hand if any of these sites have changed. A program
is needed to automatically look at the headers of web pages and tell
which ones have changed. URLCHK does the comparison after using GETURL
with the HEAD
method to retrieve the header.
Like GETURL, this program first checks that it is called with exactly
one command-line parameter. URLCHK also takes the same command-line variables
Proxy
and ProxyPort
as GETURL,
because these variables are handed over to GETURL for each URL
that gets checked. The one and only parameter is the name of a file that
contains one line for each URL. In the first column, we find the URL, and
the second and third columns hold the length of the URL’s body when checked
for the two last times. Now, we follow this plan:
It may seem a bit peculiar to read the URLs from a file together with their two most recent lengths, but this approach has several advantages. You can call the program again and again with the same file. After running the program, you can regenerate the changed URLs by extracting those lines that differ in their second and third columns:
BEGIN { if (ARGC != 2) { print "URLCHK - check if URLs have changed" print "IN:\n the file with URLs as a command-line parameter" print " file contains URL, old length, new length" print "PARAMS:\n -v Proxy=MyProxy -v ProxyPort=8080" print "OUT:\n same as file with URLs" print "JK 02.03.1998" exit } URLfile = ARGV[1]; ARGV[1] = "" if (Proxy != "") Proxy = " -v Proxy=" Proxy if (ProxyPort != "") ProxyPort = " -v ProxyPort=" ProxyPort while ((getline < URLfile) > 0) Length[$1] = $3 + 0 close(URLfile) # now, URLfile is read in and can be updated GetHeader = "gawk " Proxy ProxyPort " -v Method=\"HEAD\" -f geturl.awk " for (i in Length) { GetThisHeader = GetHeader i " 2>&1" while ((GetThisHeader | getline) > 0) if (toupper($0) ~ /CONTENT-LENGTH/) NewLength = $2 + 0 close(GetThisHeader) print i, Length[i], NewLength > URLfile if (Length[i] != NewLength) # report only changed URLs print i, Length[i], NewLength } close(URLfile) }
Another thing that may look strange is the way GETURL is called.
Before calling GETURL, we have to check if the proxy variables need
to be passed on. If so, we prepare strings that will become part
of the command line later. In GetHeader
, we store these strings
together with the longest part of the command line. Later, in the loop
over the URLs, GetHeader
is appended with the URL and a redirection
operator to form the command that reads the URL’s header over the Internet.
GETURL always sends the headers to /dev/stderr. That is
the reason why we need the redirection operator to have the header
piped in.
This program is not perfect because it assumes that changing URLs results in changed lengths, which is not necessarily true. A more advanced approach is to look at some other header line that holds time information. But, as always when things get a bit more complicated, this is left as an exercise to the reader.