Recently, as a result of our articles on checking email addresses, I received an inquiry from Kurt Sager of SWS (kurt.sager@sws.ch), the Robelle distributor in Switzerland, asking how to verify a large number of URL addresses automatically:
In some databases we have many Internet URL's, mostly links to home pages, or an html page inside a web site.It happens that such addresses contains typing errors, or the web site disappears ... we all know the problem!
We can easily and automatically create a flat file, say once a month,containing one URL per line, possibly many many thousands lines.
We need a utility to check the validity of all the addresses, possibly write a new file with the invalid adresses for other actions to take.
I didn't know of a method to do this, so I posted the question on the HP3000-L (a great resource, browse recent messages). Mike Hornsby suggested I look at Lars Appel's port of the GNU "wget" utility:
www.editcorp.com/Personal/Lars_Appel/wget/I downloaded and installed it, and it seems to work just fine.
/l testurls 1 http://www.robelle.com 2 http://www.robelle.com/tips/qedit-glue.html 3 http://www.robelle.com/bogus
xeq sh.hpbin.sys "-L -c ""wget -nv -i /SYS/TESTING/TESTURLS -o /SYS/TESTING/RESULT -O /dev/null"""notes:
/l result
1 10:47:22 URL:http://www.robelle.com:80/ [12296] -> "/dev/null" [1]
2 10:47:23 URL:http://www.robelle.com:80/tips/qedit-glue.html [4405]
-> "/dev/null" [1]
3 http://www.robelle.com:80/bogus:
4 10:47:23 ERROR 404: Not Found.
5
6 FINISHED --10:47:23--
7 Downloaded: 16,701 bytes in 2 files
This should be reasonably easy to massage into a list of failed addresses with Qedit.
Hans.Hendriks@robelle.com
January 29, 2001
|
|
|---|