Notes on my HTTP archives

What it is

This page documents my personal HTTP archive system. It permanently archives any web pages I visit.

Why bother?
How it works
Todo
Disk usage
See also

I'm doing this so I can visit web pages even after they've disappeared from the web (which shouldn't happen but often does due to careless and/or clueless information providers), and also so I can see what I was working on at any time in the past.

How it works

See the original problem description, if you like.

First I installed a caching HTTP proxy server called Squid on my Linux box from an RPM file.

Then I wrote a Perl script that converts files from Squid's cache format to a format more suitable for permanent archival.

The file names end up something like this:

/archives/http/1999/01/18/03:34:19/impressive.net,people,gerald,

See the script itself for documentation on how it works; I might write it up in further detail on this page someday, but not today.

Note: If you downloaded the squidcache2archive script prior to version 1.4, it probably looked goofy because it had literal ^@'s in it, which confused Netscape (and probably other browsers.) I have now replaced these with \x00's.

Todo

This setup doesn't quite archive enough stuff because Squid doesn't cache dynamic resources etc.; I plan to install Jigsaw and make it archive a copy of everything it retrieves. (or, maybe just write a simple HTTP proxy/archiver in Perl or Python.) Related stuff:

Archiver Proxy, self-contained Python. I would do things more like this if I did them over again.
Medusa sounds very handy for handling all the http bits in a python implementation
Muffin is a filtering system written in java, with hooks to external filters
wwwoffle might also be useful

Disk usage

Here is the disk space my archives have taken up so far (after tarring and compressing):

    root@devo: /archives/http> du -h *gz
    23M     1999-01.tar.gz
    2.2M    1999-02.tar.gz
    5.9M    1999-03.tar.gz
    43M     1999-04.tar.gz
    26M     1999-05.tar.gz
    28M     1999-06.tar.gz
    3.4M    1999-07.tar.gz
    71M     1999-08.tar.gz    (got a cablemodem)
    51M     1999-09.tar.gz
    66M     1999-10.tar.gz
    140M    1999-11.tar.gz
    173M    1999-12.tar.gz
    117M    2000-01.tar.gz
    89M     2000-02.tar.gz
    145M    2000-03.tar.gz
    175M    2000-04.tar.gz
    43M     2000-05.tar.gz    (away on a business trip)
    84M     2000-06.tar.gz    (away on vacation)
    221M    2000-07.tar.gz    (started working from home)

As of the end of July 2000, there are 307,620 files in my HTTP archive. That might sound like a lot, but it costs less than $5 to store all that data on CDs.

Notes on my HTTP archives

What it is

Contents

Why bother?

How it works

Todo

Disk usage

See also