From gerald@impressive.net Fri Jan 15 00:23:10 1999
From: gerald@impressive.net (Gerald Oskoboiny)
Message-Id: <slrn79tk5u.276.gerald@devo.impressive.net>
Newsgroups: comp.infosystems.www.servers.unix,comp.infosystems.www.misc
Subject: Archiving http proxy cache?
Organization: impressive.net
Reply-To: gerald@impressive.net

I've been archiving my incoming and outgoing e-mail for the past
6 years or so, and now that disk space is basically free I'd like
to do the same for my personal HTTP traffic.

Does anyone have ideas on what software/configuration to use for
something like this?

I installed Squid and gave it a big cache to fill, but it doesn't
quite do what I want:

  - it stores HTTP response headers and other metadata inside the
    cached files (so the files are no longer valid GIF or HTML
    files on their own because there's extra stuff at the top);
    this data should be stored externally, IMO.

  - it doesn't keep previous revisions of documents, only the
    one that was most recently-fetched (hmm, I could probably fix
    this just by replacing the unlinkd program with one that does
    nothing.)

  - its cache storage scheme makes sense for a general proxy
    cache system, but for archiving I'd prefer a directory/file
    structure more like:

        $cache_root/1999/01/15/http/www.w3.org/foo.html

Any ideas? Would I be better off using Apache or Jigsaw for this?
(as a basis for hacking/customization, I mean; I doubt that
there's anything that does exactly what I want as-is.)

It would probably be easiest for me to just write a Perl script
that does what I want and install that as the root document of a
locally-running Apache httpd, but that would probably slow things
down too much.

(My environment is Redhat Linux 5.1 with kernel 2.0.34 on a P133 :(
with 64M RAM and plenty of disk.)

-- 
Gerald Oskoboiny <gerald@impressive.net>
http://impressive.net/people/gerald/