From: ger-u@impressive.net (Gerald Oskoboiny)
Newsgroups: comp.infosystems.www.authoring.html,comp.infosystems.www.servers.unix
Subject: Automagic .txt versions of HTML pages (was Re: [Q] How to force a html page to a text version)
Date: 18 Nov 1997 00:00:00 GMT
References: <346632EC.CF0@nouce.slip.net> <3466765b.189602562@news.ican.net> <645ri5$ffs$2@news.hal-pc.org>

On 10 Nov 1997 02:29:25 GMT, Shawn K. Quinn <skquinn@brokersys.com> wrote:
:
>|One thing you can do is:
>|1) view the page with Netscape or MSIE.
>|2) press CTRL A to select all text.
>|3) paste in your favorite text editor...
>
>And how do you do that in an automated fashion? You can't. The text
>version quickly becomes out of sync with the page itself.

If you're running Apache as your Web server, you can get automatic
text versions of HTML pages quite easily with a hack I thought up
a while ago.

Just put this:

    ErrorDocument 404 /cgi-bin/404error

in your httpd's conf files, then include something like this in the
"404error" CGI script:

#!/usr/local/bin/perl
#
# 404error: a cool 404 error handler
#
# Gerald Oskoboiny, 30 Jan 1997

$htdocs    = "/www/htdocs";
$logfile   = "/usr/log/404_error_log";
$html2txt  = "/usr/local/bin/lynx -cfg=/usr/local/lib/lynx.cfg -validate -dump";

$extension = $ENV{REDIRECT_URL}; $extension =~ s/.*\.//g;
$basename  = $ENV{REDIRECT_URL}; $basename  =~ s/\.[^\.]*$//g;
$basename  =~ s|^/||g;

#####
# Check if they were looking for a ".txt" file; if so, generate one for them.
if ( ( $extension eq "txt" ) && ( -f "$htdocs/${basename}.html" ) ) {
    print "Content-Type: text/plain\n\n"; 
    open( HTML2TXT, "$html2txt http://www.hwg.org/${basename}.html |" ) ||
      die "couldn't run $html2txt with http://www.hwg.org/${basename}.html! $!";
    while (<HTML2TXT>) {
        print;
    }
    close( HTML2TXT ) || die "couldn't close $html2txt! $!";
    exit;
}
#####

# do other stuff here...


et voila! Instant .txt versions of all your HTML pages.

For example:

    http://www.hwg.org/resources/html/validation.html  (HTML)
    http://www.hwg.org/resources/html/validation.txt   (plain text)

    http://www.hwg.org/index.html
    http://www.hwg.org/index.txt

This isn't especially efficient, but it gets decent results with extremely
little effort.

Better would be to make it an Apache module triggered by a .txt Handler
that caches the automatically-generated plain text versions somewhere
after they're generated.

Gerald
-- 
Gerald Oskoboiny
<ger-u@impressive.net>