filter-html.cgi Homepage


Introduction

I have written a little cgi script to filter the annoying advert banners from sites like AltaVista, Yahoo! and HotBot. It is a Tcl script that works by converting the HTML code into very similar Tcl code, which it then executes to recreate the HTML. However, traps in the regeneration code can delete certain blocks according to simple patterns. Thus, removing entire <A HREF=...>...</A> blocks when the HREF matches "*doubleclick.net*" removes many of the adverts placed by this organization.

Trying it out

The filter works by passing an entire URL to the filter script. It uses the url-extension paradigm, so you add the url in question onto the end of the location of the CGI script. Thus, my installation is at:

http://crab.icsi.berkeley.edu:8080/~dpwe/cgi-bin/filter-html.cgi

To filter the front page of altavista, I append it to the end, like

http://crab.icsi.berkeley.edu:8080/~dpwe/cgi-bin/filter-html.cgi/http://altavista.digital.com

You can try it for the following sites:

Getting it

If you weren't able to try it out with the above links, it's because I've limited the access to certain sites - my poor machine can't filter pages for everybody! But provided you are able to install your own cgi-bin scripts (i.e. in ~/cgi-bin, like mine), you can probably have your own version of the filter. It needs version 7.5 or later of the Tcl shell, tclsh, to be installed. Although tclsh has been ported to macs and windows, I only know about the CGI interface on Unix, so that's where it works.

You can get the compressed script here.

Customization

With your own version of the script, you can change the fields that are excluded by editing the patterns in the "badhrefpatterns" and "badimgpatterns" definitions at the top of the script. Even if you don't know Tcl, you should be able to figure out how to change these.

Known problems

This was a quick hack; it relies on a 'dirty' but amusing trick to convert HTML to executable Tcl (first seen by me in a review of an old Tcl workshop that I can't now find). The problem is that the explicit >/TAG< end-of-block markers in HTML are converted willy-nilly into close-brackets ("]") in the Tcl source, so if the HTML doesn't strictly observe the nesting of its blocks, the meaning of the two forms will be different. Thus, the main problem is that it is easily broken by HTML that doesn't comply to this overly-strict definition. That said, it works on a lot of pages I've tried it with. Since it's mainly for automatically-generated index pages of commercial search sites, they tend to be fairly consistent (although the current AltaVista front page is missing a </FONT> at the bottom).

Other restrictions:

Discussion

Is it (a) ethical or (b) wise to write tools that attempt to remove the advertising from theses indexes? If they become widespread, the site organizations will disguise their adverts better. If it was successful, and everybody filtered, advertizers would lose interest in supporting these (very useful) websites. And by providing people with the tools to 'alter' the content of other people's intellectual property, am I skating on thin ice? You comments are welcome to dpwe@icsi.berkeley.edu.


Updated: $Date: 1997/01/24 00:48:28 $
DAn Ellis <dpwe@icsi.berkeley.edu>
International Computer Science Institute, Berkeley CA