This page documents my personal HTTP archive system. It permanently archives any web pages I visit.
I'm doing this so I can visit web pages even after they've disappeared from the web (which shouldn't happen but often does due to careless and/or clueless information providers), and also so I can see what I was working on at any time in the past.
First I installed a caching HTTP proxy server called Squid on my Linux box from an RPM file.
Then I wrote a Perl script that converts files from Squid's cache format to a format more suitable for permanent archival.
The file names end up something like this:
/archives/http/1999/01/18/03:34:19/impressive.net,people,gerald,
See the script itself for documentation on how it works; I might write it up in further detail on this page someday, but not today.
Note: If you downloaded the squidcache2archive
script prior to version 1.4, it probably looked goofy
because it had literal ^@'s in it, which confused
Netscape (and probably other browsers.) I have now replaced these
with \x00's.
This setup doesn't quite archive enough stuff because Squid doesn't cache dynamic resources etc.; I plan to install Jigsaw and make it archive a copy of everything it retrieves. (or, maybe just write a simple HTTP proxy/archiver in Perl or Python.) Related stuff:
Here is the disk space my archives have taken up so far (after tarring and compressing):
root@devo: /archives/http> du -h *gz
23M 1999-01.tar.gz
2.2M 1999-02.tar.gz
5.9M 1999-03.tar.gz
43M 1999-04.tar.gz
26M 1999-05.tar.gz
28M 1999-06.tar.gz
3.4M 1999-07.tar.gz
71M 1999-08.tar.gz (got a cablemodem)
51M 1999-09.tar.gz
66M 1999-10.tar.gz
140M 1999-11.tar.gz
173M 1999-12.tar.gz
117M 2000-01.tar.gz
89M 2000-02.tar.gz
145M 2000-03.tar.gz
175M 2000-04.tar.gz
43M 2000-05.tar.gz (away on a business trip)
84M 2000-06.tar.gz (away on vacation)
221M 2000-07.tar.gz (started working from home)
As of the end of July 2000, there are 307,620 files in my HTTP archive. That might sound like a lot, but it costs less than $5 to store all that data on CDs.
Last modified: $Date: 2004/09/24 08:28:31 $
Gerald Oskoboiny, <gerald@impressive.net>