This page documents my personal HTTP archive system. It permanently archives any web pages I visit.
I'm doing this so I can visit web pages even after they've disappeared from the web (which shouldn't happen but often does due to careless and/or clueless information providers), and also so I can see what I was working on at any time in the past.
First I installed a caching HTTP proxy server called Squid on my Linux box from an RPM file.
Then I wrote a Perl script that converts files from Squid's cache format to a format more suitable for permanent archival.
The file names end up something like this:
/archives/http/1999/01/18/03:34:19/impressive.net,people,gerald,
See the script itself for documentation on how it works; I might write it up in further detail on this page someday, but not today.
Note: If you downloaded the squidcache2archive
script prior to version 1.4, it probably looked goofy
because it had literal ^@
's in it, which confused
Netscape (and probably other browsers.) I have now replaced these
with \x00
's.
This setup doesn't quite archive enough stuff because Squid doesn't cache dynamic resources etc.; I plan to install Jigsaw and make it archive a copy of everything it retrieves. (or, maybe just write a simple HTTP proxy/archiver in Perl or Python.) Related stuff:
Here is the disk space my archives have taken up so far (after tarring and compressing):
root@devo: /archives/http> du -h *gz 23M 1999-01.tar.gz 2.2M 1999-02.tar.gz 5.9M 1999-03.tar.gz 43M 1999-04.tar.gz 26M 1999-05.tar.gz 28M 1999-06.tar.gz 3.4M 1999-07.tar.gz 71M 1999-08.tar.gz (got a cablemodem) 51M 1999-09.tar.gz 66M 1999-10.tar.gz 140M 1999-11.tar.gz 173M 1999-12.tar.gz 117M 2000-01.tar.gz 89M 2000-02.tar.gz 145M 2000-03.tar.gz 175M 2000-04.tar.gz 43M 2000-05.tar.gz (away on a business trip) 84M 2000-06.tar.gz (away on vacation) 221M 2000-07.tar.gz (started working from home)
As of the end of July 2000, there are 307,620 files in my HTTP archive. That might sound like a lot, but it costs less than $5 to store all that data on CDs.
Last modified: $Date: 2004/09/24 08:28:31 $
Gerald Oskoboiny, <gerald@impressive.net>