From: ger-u@impressive.net (Gerald Oskoboiny) Newsgroups: comp.infosystems.www.authoring.html,comp.infosystems.www.servers.unix Subject: Automagic .txt versions of HTML pages (was Re: [Q] How to force a html page to a text version) Date: 18 Nov 1997 00:00:00 GMT References: <346632EC.CF0@nouce.slip.net> <3466765b.189602562@news.ican.net> <645ri5$ffs$2@news.hal-pc.org> On 10 Nov 1997 02:29:25 GMT, Shawn K. Quinn wrote: : >|One thing you can do is: >|1) view the page with Netscape or MSIE. >|2) press CTRL A to select all text. >|3) paste in your favorite text editor... > >And how do you do that in an automated fashion? You can't. The text >version quickly becomes out of sync with the page itself. If you're running Apache as your Web server, you can get automatic text versions of HTML pages quite easily with a hack I thought up a while ago. Just put this: ErrorDocument 404 /cgi-bin/404error in your httpd's conf files, then include something like this in the "404error" CGI script: #!/usr/local/bin/perl # # 404error: a cool 404 error handler # # Gerald Oskoboiny, 30 Jan 1997 $htdocs = "/www/htdocs"; $logfile = "/usr/log/404_error_log"; $html2txt = "/usr/local/bin/lynx -cfg=/usr/local/lib/lynx.cfg -validate -dump"; $extension = $ENV{REDIRECT_URL}; $extension =~ s/.*\.//g; $basename = $ENV{REDIRECT_URL}; $basename =~ s/\.[^\.]*$//g; $basename =~ s|^/||g; ##### # Check if they were looking for a ".txt" file; if so, generate one for them. if ( ( $extension eq "txt" ) && ( -f "$htdocs/${basename}.html" ) ) { print "Content-Type: text/plain\n\n"; open( HTML2TXT, "$html2txt http://www.hwg.org/${basename}.html |" ) || die "couldn't run $html2txt with http://www.hwg.org/${basename}.html! $!"; while () { print; } close( HTML2TXT ) || die "couldn't close $html2txt! $!"; exit; } ##### # do other stuff here... et voila! Instant .txt versions of all your HTML pages. For example: http://www.hwg.org/resources/html/validation.html (HTML) http://www.hwg.org/resources/html/validation.txt (plain text) http://www.hwg.org/index.html http://www.hwg.org/index.txt This isn't especially efficient, but it gets decent results with extremely little effort. Better would be to make it an Apache module triggered by a .txt Handler that caches the automatically-generated plain text versions somewhere after they're generated. Gerald -- Gerald Oskoboiny