Notes on my HTTP archives

by Gerald Oskoboiny


What it is

This page documents my personal HTTP archive system. It permanently archives any web pages I visit.

Contents

Why bother?

I'm doing this so I can visit web pages even after they've disappeared from the web (which shouldn't happen but often does due to careless and/or clueless information providers), and also so I can see what I was working on at any time in the past.

How it works

See the original problem description, if you like.

First I installed a caching HTTP proxy server called Squid on my Linux box from an RPM file.

Then I wrote a Perl script that converts files from Squid's cache format to a format more suitable for permanent archival.

The file names end up something like this:

/archives/http/1999/01/18/03:34:19/impressive.net,people,gerald,

See the script itself for documentation on how it works; I might write it up in further detail on this page someday, but not today.

Note: If you downloaded the squidcache2archive script prior to version 1.4, it probably looked goofy because it had literal ^@'s in it, which confused Netscape (and probably other browsers.) I have now replaced these with \x00's.

Todo

This setup doesn't quite archive enough stuff because Squid doesn't cache dynamic resources etc.; I plan to install Jigsaw and make it archive a copy of everything it retrieves. (or, maybe just write a simple HTTP proxy/archiver in Perl or Python.) Related stuff:

Disk usage

Here is the disk space my archives have taken up so far (after tarring and compressing):

    root@devo: /archives/http> du -h *gz
    23M     1999-01.tar.gz
    2.2M    1999-02.tar.gz
    5.9M    1999-03.tar.gz
    43M     1999-04.tar.gz
    26M     1999-05.tar.gz
    28M     1999-06.tar.gz
    3.4M    1999-07.tar.gz
    71M     1999-08.tar.gz    (got a cablemodem)
    51M     1999-09.tar.gz
    66M     1999-10.tar.gz
    140M    1999-11.tar.gz
    173M    1999-12.tar.gz
    117M    2000-01.tar.gz
    89M     2000-02.tar.gz
    145M    2000-03.tar.gz
    175M    2000-04.tar.gz
    43M     2000-05.tar.gz    (away on a business trip)
    84M     2000-06.tar.gz    (away on vacation)
    221M    2000-07.tar.gz    (started working from home)

As of the end of July 2000, there are 307,620 files in my HTTP archive. That might sound like a lot, but it costs less than $5 to store all that data on CDs.

See also


Valid HTML 4.0! Last modified: $Date: 2004/09/24 08:28:31 $
Gerald Oskoboiny, <[email protected]>