This is so freakin cool!
Internet Archive Wayback Machine
http://web.archive.org/
Here's a blast from the past (my old site):
http://web.archive.org/web/19970606215310/http://ugweb.cs.ualberta.ca/~gerald/
and some past versions of my current site (not much different than now):
http://web.archive.org/web/*/http://impressive.net/people/gerald/
I have been trying to think of stuff that I lost in the last few
years to see if I could retrieve it from this archive, but I
haven't thought of anything yet. I think I became an archive nut
a few years before the Internet Archive went into production, so
I already have anything I really want. (and of course I have
been archiving my own clickstream for the last few years. [1])
Too bad their archive goesn't go back further; it would be really
cool to be able to surf around the early web.
Here's more info about their archive and the tech behind it:
http://web.archive.org/collections/web/faqs.html
> The Wayback Machine
> 1. What is the Wayback Machine?
> 2. Can I link to old pages on the Wayback Machine?
> 3. Are other sites available in the Wayback Machine?
> 4. What does it mean when a site's archive data has been "updated"?
> 5. Who was involved in the creating the Wayback Machine?
> 6. How was the Wayback Machine made?
> 7. How large is the Archive?
> 8. Can I search the Wayback Machine?
> 9. What type of machinery is used in this Wayback Machine?
> 10. How do you archive dynamic pages?
> 11. Why are some sites harder to archive than others?
> 12. Some sites are not available because of Robots.txt or other
> exclusions. What does that mean?
> 13. How can I get my site included in the Archive?
> 14. How can I help?
>
>
> Answers
> 1. What is the Wayback Machine?
> The Wayback Machine is a service that allows people to visit
> archived versions of stored websites. Visitors to the Wayback
> Machine can type in an URL, select a date, and then begin surfing
> on an archived version of the web. Imagine surfing circa 1999 and
> looking at all the Y2K hype, or revisiting an older copy of your
> favorite website. The Wayback Machine can make all of this
> possible. See the [19]Press Release.
> 2. Can I link to old pages on the Wayback Machine?
> Yes! Alexa Internet has built the Wayback Machine so that it can
> be used and referenced by anybody and everybody. If you find an
> archived page that you would like to reference on your web page or
> in an article, you can copy the URL and share it with others. You
> can even use fuzzy URL matching and date specifications... but
> that's a [20]bit more advanced.
> 3. Are other sites available in the Wayback Machine?
> The Internet Archive is attempting to archive the entire publicly
> available web. Some sites may not be included because the
> automated crawlers were unaware of their existence at the time of
> the crawl. It's also possible that some sites were not archived
> because they were password protected or otherwise inaccessible to
> our automated systems.
> 4. What does it mean when a site's archive date has been "updated"?
> When our automated systems crawl the web every few months or so,
> we find that only about 50% of all pages on the web have changed
> from our previous visit. This means that much of the content in
> our archive is duplicate material. If you don't see "*" next to
> an archived document, then the content on the archived page is
> identical to the previously archived copy.
> 5. Who was involved in creating the Wayback Machine?
> The original idea for the Wayback Machine began in 1996, when the
> Internet Archive first began archiving the web. Now, five years
> later, with over 100 terabytes and a dozen web crawls completed,
> the Internet Archive has made the Wayback Machine available to the
> public. The Internet Archive has relied on donations of web
> crawls, technology and expertise from Alexa Internet. The Wayback
> Machine is owned and operated by the Internet Archive.
> 6. How was the Wayback Machine made?
> Over 100 terabytes of data are stored on several dozen modified
> servers situated in the basement of a former military building in
> the Presidio of San Francisco. Alexa Internet, in cooperation with
> the Internet Archive, has designed a three dimensional index that
> allows browsing of web documents over multiple time periods, and
> turned this unique feature into the Wayback Machine.
> 7. How large is the Archive?
> The Wayback Machine contains over 100 terabytes of data and is
> currently growing at a rate of 12 terabytes per month. The
> archive contains multiple copies of the entire publicly available
> web. This eclipses the amount of data contained in the world's
> largest libraries, including the Library of Congress. The Wayback
> Machine is the largest known database by a factor of 20. If you
> tried to place the entire contents of the archive onto floppy
> disks (I don't recommend this!) and laid them end to end, it would
> stretch from New York, past Los Angeles, and halfway to Hawaii.
> 8. Can I search the Wayback Machine?
> Using the Wayback Machine, it is possible to search for the names
> of sites contained in the collection and to specify date ranges
> for your search. However, we do not yet have an indexed text
> search of the documents in the collection. The collection is a bit
> too large and complicated for that. We continue to work on it and
> should have a full text search soon.
> 9. What type of machinery is used in the Wayback Machine?
> The Internet Archive is stored on dozens of slightly modified
> Hewlett Packard servers. The computers run on the FreeBSD
> operating system. Each computer has 512Mb of memory and can hold
> just over 300 gigabytes of data on IDE disks.
> 10. How do you archive dynamic pages?
> There are many different kinds of dynamic pages, some of which are
> easily stored in an archive and some of which fall apart
> completely. When a dynamic page renders standard html, the archive
> works beautifully. When a dynamic page contains forms, JavaScript,
> or other elements that require interaction with the originating
> host, the archive will not accurately reflect the original site's
> functionality.
> 11. Why are some sites harder to archive than others?
> If you look at our collection of archived sites, you will find
> some broken pages, missing graphics, and some sites that aren't
> archived at all. We have tried to create a complete archive, but
> have had difficulties with some sites. Here are some things that
> make it difficult to archive a web site:
> + Robots.txt -- If our robot crawler is forbidden from visiting
> a site, we can't archive it.
> + Javascript -- Javascript elements are often hard for us to
> archive, but especially if it generates links without having
> the full name in the page. Plus, if javascript needs to
> contact with the originiating server in order to work, it
> will fail when archived.
> + Server side image maps -- Like any functionality on the web,
> if it needs to contact the originating server in order to
> work, it will fail when archived.
> + Unknown sites -- If Alexa doesn't know about your site, it
> won't be archived. Use the Alexa service, and we will know
> about your page. Or you can visit our [21]Archive Your Site
> page.
> + Orphan pages -- If there are no links to your pages, our
> robot won't find it (our robots don't enter queries in search
> boxes.)
> As a general rule of thumb, simple html is the easiest to archive.
> 12. Some sites are not available because of Robots.txt or other
> exclusions.
> What does that mean?
> The Standard for Robot Exclusion (SRE) is a means by which web
> site owners can instruct automated systems not to crawl their
> sites. Web site owners can specify files or directories that are
> allowed or disallowed from a crawl, and they can even create
> specific rules for different automated crawlers. All of this
> information is contained in a file called robots.txt. While
> robots.txt has been adopted as the universal standard for robot
> exclusion, compliance with robots.txt is strictly voluntary. In
> fact most web sites do not have a robots.txt file, and many web
> crawlers are not programmed to obey the instructions anyway.
> However, Alexa, the company that crawls the web for the Internet
> Archive, does respect robots.txt instructions, and even does so
> retroactively. If a web site owner ever decides he / she prefers
> not to have a web crawler visiting his / her files and sets up
> robots.txt on the site, the Alexa crawlers will stop visiting
> those files and mark all files previously gathered as unavailable.
> This means that sometimes, while using the Internet Archive
> Wayback Machine, you may find a site that is unavailable due to
> robots.txt or other exclusions. Other exclusions? Yes, sometimes a
> web site owner will contact us directly and ask us to stop
> crawling or archiving a site. We comply with these requests.
> 13. How can I get my site included in the Archive?
> Alexa Internet has been crawling the web since 1996, which has
> resulted in a massive archive. If you have a web site, and you
> would like to ensure that it is saved for posterity in the Alexa
> Archive, chances are that it's already there. We make every effort
> to crawl the entire publicly available web. However, if you wish
> to take extra measures to ensure that we archive your site, you
> can visit the Alexa "[22]Archive Your Site" page.
> 14. How can I help?
> The Internet Archive actively seeks donations of digital materials
> for preservation. Alexa Internet provides access to a web-wide
> crawl that contains copies of the publicly accessible web. If you
> have digital materials that may be of interest to future
> generations, [23]let us know. The Internet Archive is also seeking
> additional funding to continue this important mission. Please
> [24]contact us if you wish to make a contribution.
>
> The Internet Archive Wayback Machine is a service created by [28]Alexa
> to enable people to surf an ongoing archive of the web.
> [
> References
>
> 19.
http://web.archive.org/collections/e2k/press_release.html
> 20.
http://web.archive.org/collections/web/advanced.html
> 21.
http://www.alexa.com/help/webmasters/request_bot.html
> 22.
http://www.alexa.com/help/webmasters/request_bot.html
> 23.
http://www.archive.org/internet/proposal.html
> 24. mailto:info@archive.org
> 28.
http://www.alexa.com/
> ]
[1]
http://impressive.net/people/gerald/1999/01/http-archive/
--
Gerald Oskoboiny <gerald@impressive.net>
http://impressive.net/people/gerald/