the Internet Archive Wayback Machine

by Gerald Oskoboiny <gerald@impressive.net>

 Date:  Tue, 16 Oct 2001 04:21:55 -0400
 To:  fogo@impressive.net
 Replies:  aswartz hugo gerald hugo2 gerald2 hugo3 reagle ij
This is so freakin cool!

    Internet Archive Wayback Machine
    http://web.archive.org/

Here's a blast from the past (my old site):

    http://web.archive.org/web/19970606215310/http://ugweb.cs.ualberta.ca/~gerald/

and some past versions of my current site (not much different than now):

    http://web.archive.org/web/*/http://impressive.net/people/gerald/

I have been trying to think of stuff that I lost in the last few
years to see if I could retrieve it from this archive, but I
haven't thought of anything yet. I think I became an archive nut
a few years before the Internet Archive went into production, so
I already have anything I really want. (and of course I have
been archiving my own clickstream for the last few years. [1])

Too bad their archive goesn't go back further; it would be really
cool to be able to surf around the early web.

Here's more info about their archive and the tech behind it:

http://web.archive.org/collections/web/faqs.html

>    The Wayback Machine
>    1. What is the Wayback Machine?
>    2. Can I link to old pages on the Wayback Machine?
>    3. Are other sites available in the Wayback Machine?
>    4. What does it mean when a site's archive data has been "updated"?
>    5. Who was involved in the creating the Wayback Machine?
>    6. How was the Wayback Machine made?
>    7. How large is the Archive?
>    8. Can I search the Wayback Machine?
>    9. What type of machinery is used in this Wayback Machine?
>    10. How do you archive dynamic pages?
>    11. Why are some sites harder to archive than others?
>    12. Some sites are not available because of Robots.txt or other
>        exclusions. What does that mean?
>    13. How can I get my site included in the Archive?
>    14. How can I help?
>    
> 
>                                    Answers
>     1. What is the Wayback Machine?
>        The Wayback Machine is a service that allows people to visit
>        archived versions of stored websites.  Visitors to the Wayback
>        Machine can type in an URL, select a date, and then begin surfing
>        on an archived version of the web.  Imagine surfing circa 1999 and
>        looking at all the Y2K hype, or revisiting an older copy of your
>        favorite website.  The Wayback Machine can make all of this
>        possible. See the [19]Press Release.
>     2. Can I link to old pages on the Wayback Machine?
>        Yes! Alexa Internet has built the Wayback Machine so that it can
>        be used and referenced by anybody and everybody. If you find an
>        archived page that you would like to reference on your web page or
>        in an article, you can copy the URL and share it with others. You
>        can even use fuzzy URL matching and date specifications... but
>        that's a [20]bit more advanced.
>     3. Are other sites available in the Wayback Machine?
>        The Internet Archive is attempting to archive the entire publicly
>        available web.  Some sites may not be included because the
>        automated crawlers were unaware of their existence at the time of
>        the crawl.  It's also possible that some sites were not archived
>        because they were password protected or otherwise inaccessible to
>        our automated systems.
>     4. What does it mean when a site's archive date has been "updated"?
>        When our automated systems crawl the web every few months or so,
>        we find that only about 50% of all pages on the web have changed
>        from our previous visit.  This means that much of the content in
>        our archive is duplicate material.  If you don't see "*" next to
>        an archived document, then the content on the archived page is
>        identical to the previously archived copy.
>     5. Who was involved in creating the Wayback Machine?
>        The original idea for the Wayback Machine began in 1996, when the
>        Internet Archive first began archiving the web.  Now, five years
>        later, with over 100 terabytes and a dozen web crawls completed,
>        the Internet Archive has made the Wayback Machine available to the
>        public.  The Internet Archive has relied on donations of web
>        crawls, technology and expertise from Alexa Internet.  The Wayback
>        Machine is owned and operated by the Internet Archive.
>     6. How was the Wayback Machine made?
>        Over 100 terabytes of data are stored on several dozen modified
>        servers situated in the basement of a former military building in
>        the Presidio of San Francisco. Alexa Internet, in cooperation with
>        the Internet Archive, has designed a three dimensional index that
>        allows browsing of web documents over multiple time periods, and
>        turned this unique feature into the Wayback Machine.
>     7. How large is the Archive?
>        The Wayback Machine contains over 100 terabytes of data and is
>        currently growing at a rate of 12 terabytes per month.  The
>        archive contains multiple copies of the entire publicly available
>        web.  This eclipses the amount of data contained in the world's
>        largest libraries, including the Library of Congress.  The Wayback
>        Machine is the largest known database by a factor of 20.  If you
>        tried to place the entire contents of the archive onto floppy
>        disks (I don't recommend this!) and laid them end to end, it would
>        stretch from New York, past Los Angeles, and halfway to Hawaii.
>     8. Can I search the Wayback Machine?
>        Using the Wayback Machine, it is possible to search for the names
>        of sites contained in the collection and to specify date ranges
>        for your search. However, we do not yet have an indexed text
>        search of the documents in the collection. The collection is a bit
>        too large and complicated for that. We continue to work on it and
>        should have a full text search soon.
>     9. What type of machinery is used in the Wayback Machine?
>        The Internet Archive is stored on dozens of slightly modified
>        Hewlett Packard servers. The computers run on the FreeBSD
>        operating system. Each computer has 512Mb of memory and can hold
>        just over 300 gigabytes of data on IDE disks.
>    10. How do you archive dynamic pages?
>        There are many different kinds of dynamic pages, some of which are
>        easily stored in an archive and some of which fall apart
>        completely. When a dynamic page renders standard html, the archive
>        works beautifully. When a dynamic page contains forms, JavaScript,
>        or other elements that require interaction with the originating
>        host, the archive will not accurately reflect the original site's
>        functionality.
>    11. Why are some sites harder to archive than others?
>        If you look at our collection of archived sites, you will find
>        some broken pages, missing graphics, and some sites that aren't
>        archived at all. We have tried to create a complete archive, but
>        have had difficulties with some sites. Here are some things that
>        make it difficult to archive a web site:
>           + Robots.txt -- If our robot crawler is forbidden from visiting
>             a site, we can't archive it.
>           + Javascript -- Javascript elements are often hard for us to
>             archive, but especially if it generates links without having
>             the full name in the page. Plus, if javascript needs to
>             contact with the originiating server in order to work, it
>             will fail when archived.
>           + Server side image maps -- Like any functionality on the web,
>             if it needs to contact the originating server in order to
>             work, it will fail when archived.
>           + Unknown sites -- If Alexa doesn't know about your site, it
>             won't be archived. Use the Alexa service, and we will know
>             about your page. Or you can visit our [21]Archive Your Site
>             page.
>           + Orphan pages -- If there are no links to your pages, our
>             robot won't find it (our robots don't enter queries in search
>             boxes.)
>        As a general rule of thumb, simple html is the easiest to archive.
>    12. Some sites are not available because of Robots.txt or other
>        exclusions.
>        What does that mean?
>        The Standard for Robot Exclusion (SRE) is a means by which web
>        site owners can instruct automated systems not to crawl their
>        sites. Web site owners can specify files or directories that are
>        allowed or disallowed from a crawl, and they can even create
>        specific rules for different automated crawlers. All of this
>        information is contained in a file called robots.txt. While
>        robots.txt has been adopted as the universal standard for robot
>        exclusion, compliance with robots.txt is strictly voluntary. In
>        fact most web sites do not have a robots.txt file, and many web
>        crawlers are not programmed to obey the instructions anyway.
>        However, Alexa, the company that crawls the web for the Internet
>        Archive, does respect robots.txt instructions, and even does so
>        retroactively. If a web site owner ever decides he / she prefers
>        not to have a web crawler visiting his / her files and sets up
>        robots.txt on the site, the Alexa crawlers will stop visiting
>        those files and mark all files previously gathered as unavailable.
>        This means that sometimes, while using the Internet Archive
>        Wayback Machine, you may find a site that is unavailable due to
>        robots.txt or other exclusions. Other exclusions? Yes, sometimes a
>        web site owner will contact us directly and ask us to stop
>        crawling or archiving a site. We comply with these requests.
>    13. How can I get my site included in the Archive?
>        Alexa Internet has been crawling the web since 1996, which has
>        resulted in a massive archive. If you have a web site, and you
>        would like to ensure that it is saved for posterity in the Alexa
>        Archive, chances are that it's already there. We make every effort
>        to crawl the entire publicly available web. However, if you wish
>        to take extra measures to ensure that we archive your site, you
>        can visit the Alexa "[22]Archive Your Site" page.
>    14. How can I help?
>        The Internet Archive actively seeks donations of digital materials
>        for preservation. Alexa Internet provides access to a web-wide
>        crawl that contains copies of the publicly accessible web. If you
>        have digital materials that may be of interest to future
>        generations, [23]let us know. The Internet Archive is also seeking
>        additional funding to continue this important mission. Please
>        [24]contact us if you wish to make a contribution.
> 
>    The Internet Archive Wayback Machine is a service created by [28]Alexa
>           to enable people to surf an ongoing archive of the web.
> [
> References
> 
>   19. http://web.archive.org/collections/e2k/press_release.html
>   20. http://web.archive.org/collections/web/advanced.html
>   21. http://www.alexa.com/help/webmasters/request_bot.html
>   22. http://www.alexa.com/help/webmasters/request_bot.html
>   23. http://www.archive.org/internet/proposal.html
>   24. mailto:info@archive.org
>   28. http://www.alexa.com/
> ]

[1] http://impressive.net/people/gerald/1999/01/http-archive/

-- 
Gerald Oskoboiny <gerald@impressive.net>
http://impressive.net/people/gerald/

HURL: fogo mailing list archives, maintained by Gerald Oskoboiny