deadlinkcheck - Dead Link Check (v0.4.0)
deadlinkcheck [-help] [-verb | -Verb [-indicator]] [-proxy proxy | -Proxy] [[-timeCache value] | [-noCache]] [-Timeout value[:maxvalue]] [-later [percent]] [-userRedirect] [-Content [rule[:...]]] [[-output [filename]] [-splitOutput]] [-rawOutput | -detOutput] [-codeConversion] [-HTMLoutput] [-Dif] urls | filename
deadlinkcheck is a program that is still being evolved. Current version is already stable, but may not be fully functional.
The option -Content (introduced in v0.4.0) is considered beta.
Dead Link Check (DLC) is a Perl script designed to find information on validity of HTTP references. The
script may use/generate a cache file for avoiding redoing network requests
if the user want to check added entries. The script works by reading
entries from a file (or a list of links from the command line) and output
results in file(s)
(or STDOUT).
DLC was created as an extension to Public Bookmark Generator (PBM), but can be used by itself.
To obtain options values and options default values, run deadlinkcheck -help
print a description of all the deadlinkcheck options.
run the script in verbose mode, printing advanced information on STDERR.
run the script in maximum verbose mode, printing advanced run information on STDERR.
show a progress indicator (in percent) at the beginning of the verbose output indicator.
set the http and ftp proxy to proxy.
get proxy information from environment variables (http_proxy, ftp_proxy).
will set number of trusted days for links cache to value. Using a value of 0 will delete every entry in the cache.
will check links without reading/writing information from/to cache file.
set network timeout for requests to value. In case of retry, the script may extend the timeout value to maxvalue.
some requests to ftp links may not be successful at a certain time. This option allow to retry the links later in the processing. The percent optional argument is to force to retry the ftp request every percent (or so).
will follow user HTTP-EQUIV redirections. Will force a GET to see if the user did not use a META command to force a redirection.
will parse the content of the checked web page for information that may indicate that the page is an error return (some web servers do not always return proper HTTP responses) and/or was moved. This option is based on the use of rules. Those rules can be of two types : error or move. You can use those type names to force the use of all error or move rules. The help rule is used to provide information on all the selected rules.
will output results into filename if option is set. Print to STDOUT if output option is not used. If no filename is provided, will save results to default option value.
will output results into different files according to the first number of their HTTP return code.
will output only raw HTTP addresses. It is recommended to use this output mode only if output is saved into split files.
will output detailed results.
will print text instead of return code in a detailed output mode.
will output results as HTML code (easier to follow links).
will not print the ``DLC information Footer'' (added at the end of HTML outputs).
As of now only links starting with file:/, ftp://, and http:// are supported.
Links can be given to the script in two forms; on the command line, or from an input file. The input file may be user provided or Public Bookmark Generator created. The input file may contain up to two information per line, the second being optional. Both information must be separated by a tabulation. Those informations are HTTPlink and HTTPname, where the first one is a fully qualified HTTP reference, and the second one is the name to be printed in reference to this link.
Public Bookmark Generator created file fills the second field by the fully qualified name in the bookmark list (folders are separated by |).
Example : |Work Informations|Developpements|Public Bookmark Generator
When using the Maximum Verbose option, a few informations are printed. In order :
[ xxx.xx % ] : a progress indicator (if indicator option is selected).
* : means that the script is retrying a ftp site (later option is selected).
(url) : indicates the url being checked at that time.
@ : indicates that the script is doing a network connection (by using the cache, some already checked network connections can be avoided).
[return code and action] : indicates the return code for this request, and optionally, actions to be taken.
Example :
[ 80.00 % ] * (ftp://ftp.redhat.com/) @ [401 -> Retry later]
means that the script already processed 80 % of the provided urls, is retrying a ftp site which url is ftp://ftp.redhat.com/ by doing a network connection, and the result of this request is 401 (Unauthorized) so the action to be taken is to Retry later.
Dead Link Check uses a link cache file to fasten access to the same link in case of multiple run. The cache file (stored in the directory of the run) contains a time stamp information so that after a certain time, the links are not valid anymore.
RFC 2616 tells us that :
Informational 1xx 100 is Continue, 101 is Switching Protocols,
Successful 2xx 200 is OK, 201 is Created, 202 is Accepted, 203 is Non-Authoritative Information, 204 is No Content, 205 is Reset Content, 206 is Partial Content,
Redirection 3xx 300 is Multiple Choices, 301 is Moved Permanently, 302 is Moved Temporalily, 304 is Not Modified, 305 is Use Proxy, 307 is Temporary Redirect,
Client Error 4xx 400 is Bad Request, 401 is Unauthorized, 403 is Forbidden, 404 is Not Found, 405 is Method Not Allowed, 406 is Not Acceptable, 407 is Proxy Authentication Required, 408 is Request Time-out, 409 is Conflict, 410 is Gone, 411 is Length Required, 412 is Precondition Failed, 413 is Request Entity Too Large, 414 is Request-URI Too Large, 415 is Unsupported Media Type, 416 is Requested range not satisfiable, 417 is Expectation Failed,
Server Error <5xx> 500 is Internal Server Error, 501 is Not Implemented, 502 is Bad Gateway, 503 is Service Unavailable, 504 is Gateway Time-out, 505 is HTTP Version not supported.
A special return code (399) has been introduced to handle user moved web pages (using HTTP-EQUIV in the HTML source code) when the option -userRedirect is set.
Another special return code (398) is used when detecting an infinite loop redirection.
A redirect code (397) may be seen when the option -Content is used with at least one rule of type move. It tells that the web page may have moved.
A special page not found code (499) may be seen when the option -Content is used with at least one rule of type error. It tells that the web page may not exist.
Some ftp sites may return 400 and 401 codes, still they may exist (they may just be unavailable at the time of the request).
Some http sites may return 500 code, still they may exist (the site may have not been available before timeout).
Note that if the script encounter return codes not defined in RFC 2616, it will output those links in a special section.
Use of a WWW cache server can be done using either the -proxy or the -Proxy option. The first one will read the proxy server from the command line, the second one will extract it from the environment variables (http_proxy and ftp_proxy).
If no proxy option is used, the script will run without proxy.
This option is considered in beta status.
The rules are based on regular expression parsing of the content of the web page checked. This slows down the processing of DLC, and may not return a proper result, it is recommanded to check the web page to verify. Since it is based on text processing, it can only recognize entries for which it has rules, and may well miss some error or moved links.
Extending the rules is an easy process if you know how to use regular expressions in Perl, and are willing to edit the code of ``deadlinkcheck''. Inside the source code is a section called ``How to create a rule ?'' which should help you create (or modify) some rules. If you do so, please send a diff (or simply the function) by e-mail to martial@users.sourceforge.net.
You can find the Dead Link Check homepage at : http://dlc.sourceforge.net/. You may also want to check Public Bookmark Generator, which web page is http://pbm.sourceforge.net/.
For bug reporting please send an e-mail to the author at martial@users.sourceforge.net with [DLC] in the title.
Here are the list of person who helped improve this script, and that the author wish to thank :
Marc Bednarek and Jimmy Graham for helping solving tricky little response codes and providing me with some examples and information on how to resolve or do some http requests.
Wojciech Zwiefka for reporting a bug on infinite redirection loop.
Geoffrey Leach for reporting a bug on Timeout/maxTimeout interaction.
Josha Foust for reporting an http to ftp redirect bug.
Josha Foust for reporting a redirect bug, and a bug on url naming
Olivier Galibert for reporting the lowercased URLs bug
The sourceforge team for the fantastic job they are doing providing Open Source Coders such facilities.
Copyright (C) 1999 Martial MICHEL
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
More license information : http://www.gnu.org/copyleft/gpl.html
v0.0 : April 1st, 1999 v0.1 : April 12th, 1999 v0.1.1 : April 15th, 1999 v0.1.2 : April 16th, 1999 v0.2 : May 10th, 1999 v0.2.1 : May 11th, 1999 v0.3 : July 21st, 1999 v0.3.1 : October 4th, 1999 v0.3.2 : October 6th, 1999 v0.4.0 : December 7th, 1999
Martial MICHEL (martial@users.sourceforge.net)