the Internet Archive Wayback Machine

from Gerald Oskoboiny <gerald@impressive.net>, Tue, 16 Oct 2001 04:21:55 -0400

Replies:

Parents:

None.

This is so freakin cool!

Internet Archive Wayback Machine
http://web.archive.org/

Here's a blast from the past (my old site):

http://web.archive.org/web/19970606215310/http://ugweb.cs.ualberta.ca/~gerald/

and some past versions of my current site (not much different than now):

http://web.archive.org/web/*/http://impressive.net/people/gerald/

I have been trying to think of stuff that I lost in the last few
years to see if I could retrieve it from this archive, but I
haven't thought of anything yet. I think I became an archive nut
a few years before the Internet Archive went into production, so
I already have anything I really want. (and of course I have
been archiving my own clickstream for the last few years. [1])

Too bad their archive goesn't go back further; it would be really
cool to be able to surf around the early web.

Here's more info about their archive and the tech behind it:

http://web.archive.org/collections/web/faqs.html

> The Wayback Machine
> 1. What is the Wayback Machine?
> 2. Can I link to old pages on the Wayback Machine?
> 3. Are other sites available in the Wayback Machine?
> 4. What does it mean when a site's archive data has been "updated"?
> 5. Who was involved in the creating the Wayback Machine?
> 6. How was the Wayback Machine made?
> 7. How large is the Archive?
> 8. Can I search the Wayback Machine?
> 9. What type of machinery is used in this Wayback Machine?
> 10. How do you archive dynamic pages?
> 11. Why are some sites harder to archive than others?
> 12. Some sites are not available because of Robots.txt or other
> exclusions. What does that mean?
> 13. How can I get my site included in the Archive?
> 14. How can I help?
>
>
> Answers
> 1. What is the Wayback Machine?
> The Wayback Machine is a service that allows people to visit
> archived versions of stored websites. Visitors to the Wayback
> Machine can type in an URL, select a date, and then begin surfing
> on an archived version of the web. Imagine surfing circa 1999 and
> looking at all the Y2K hype, or revisiting an older copy of your
> favorite website. The Wayback Machine can make all of this
> possible. See the [19]Press Release.
> 2. Can I link to old pages on the Wayback Machine?
> Yes! Alexa Internet has built the Wayback Machine so that it can
> be used and referenced by anybody and everybody. If you find an
> archived page that you would like to reference on your web page or
> in an article, you can copy the URL and share it with others. You
> can even use fuzzy URL matching and date specifications... but
> that's a [20]bit more advanced.
> 3. Are other sites available in the Wayback Machine?
> The Internet Archive is attempting to archive the entire publicly
> available web. Some sites may not be included because the
> automated crawlers were unaware of their existence at the time of
> the crawl. It's also possible that some sites were not archived
> because they were password protected or otherwise inaccessible to
> our automated systems.
> 4. What does it mean when a site's archive date has been "updated"?
> When our automated systems crawl the web every few months or so,
> we find that only about 50% of all pages on the web have changed
> from our previous visit. This means that much of the content in
> our archive is duplicate material. If you don't see "*" next to
> an archived document, then the content on the archived page is
> identical to the previously archived copy.
> 5. Who was involved in creating the Wayback Machine?
> The original idea for the Wayback Machine began in 1996, when the
> Internet Archive first began archiving the web. Now, five years
> later, with over 100 terabytes and a dozen web crawls completed,
> the Internet Archive has made the Wayback Machine available to the
> public. The Internet Archive has relied on donations of web
> crawls, technology and expertise from Alexa Internet. The Wayback
> Machine is owned and operated by the Internet Archive.
> 6. How was the Wayback Machine made?
> Over 100 terabytes of data are stored on several dozen modified
> servers situated in the basement of a former military building in
> the Presidio of San Francisco. Alexa Internet, in cooperation with
> the Internet Archive, has designed a three dimensional index that
> allows browsing of web documents over multiple time periods, and
> turned this unique feature into the Wayback Machine.
> 7. How large is the Archive?
> The Wayback Machine contains over 100 terabytes of data and is
> currently growing at a rate of 12 terabytes per month. The
> archive contains multiple copies of the entire publicly available
> web. This eclipses the amount of data contained in the world's
> largest libraries, including the Library of Congress. The Wayback
> Machine is the largest known database by a factor of 20. If you
> tried to place the entire contents of the archive onto floppy
> disks (I don't recommend this!) and laid them end to end, it would
> stretch from New York, past Los Angeles, and halfway to Hawaii.
> 8. Can I search the Wayback Machine?
> Using the Wayback Machine, it is possible to search for the names
> of sites contained in the collection and to specify date ranges
> for your search. However, we do not yet have an indexed text
> search of the documents in the collection. The collection is a bit
> too large and complicated for that. We continue to work on it and
> should have a full text search soon.
> 9. What type of machinery is used in the Wayback Machine?
> The Internet Archive is stored on dozens of slightly modified
> Hewlett Packard servers. The computers run on the FreeBSD
> operating system. Each computer has 512Mb of memory and can hold
> just over 300 gigabytes of data on IDE disks.
> 10. How do you archive dynamic pages?
> There are many different kinds of dynamic pages, some of which are
> easily stored in an archive and some of which fall apart
> completely. When a dynamic page renders standard html, the archive
> works beautifully. When a dynamic page contains forms, JavaScript,
> or other elements that require interaction with the originating
> host, the archive will not accurately reflect the original site's
> functionality.
> 11. Why are some sites harder to archive than others?
> If you look at our collection of archived sites, you will find
> some broken pages, missing graphics, and some sites that aren't
> archived at all. We have tried to create a complete archive, but
> have had difficulties with some sites. Here are some things that
> make it difficult to archive a web site:
> + Robots.txt -- If our robot crawler is forbidden from visiting
> a site, we can't archive it.
> + Javascript -- Javascript elements are often hard for us to
> archive, but especially if it generates links without having
> the full name in the page. Plus, if javascript needs to
> contact with the originiating server in order to work, it
> will fail when archived.
> + Server side image maps -- Like any functionality on the web,
> if it needs to contact the originating server in order to
> work, it will fail when archived.
> + Unknown sites -- If Alexa doesn't know about your site, it
> won't be archived. Use the Alexa service, and we will know
> about your page. Or you can visit our [21]Archive Your Site
> page.
> + Orphan pages -- If there are no links to your pages, our
> robot won't find it (our robots don't enter queries in search
> boxes.)
> As a general rule of thumb, simple html is the easiest to archive.
> 12. Some sites are not available because of Robots.txt or other
> exclusions.
> What does that mean?
> The Standard for Robot Exclusion (SRE) is a means by which web
> site owners can instruct automated systems not to crawl their
> sites. Web site owners can specify files or directories that are
> allowed or disallowed from a crawl, and they can even create
> specific rules for different automated crawlers. All of this
> information is contained in a file called robots.txt. While
> robots.txt has been adopted as the universal standard for robot
> exclusion, compliance with robots.txt is strictly voluntary. In
> fact most web sites do not have a robots.txt file, and many web
> crawlers are not programmed to obey the instructions anyway.
> However, Alexa, the company that crawls the web for the Internet
> Archive, does respect robots.txt instructions, and even does so
> retroactively. If a web site owner ever decides he / she prefers
> not to have a web crawler visiting his / her files and sets up
> robots.txt on the site, the Alexa crawlers will stop visiting
> those files and mark all files previously gathered as unavailable.
> This means that sometimes, while using the Internet Archive
> Wayback Machine, you may find a site that is unavailable due to
> robots.txt or other exclusions. Other exclusions? Yes, sometimes a
> web site owner will contact us directly and ask us to stop
> crawling or archiving a site. We comply with these requests.
> 13. How can I get my site included in the Archive?
> Alexa Internet has been crawling the web since 1996, which has
> resulted in a massive archive. If you have a web site, and you
> would like to ensure that it is saved for posterity in the Alexa
> Archive, chances are that it's already there. We make every effort
> to crawl the entire publicly available web. However, if you wish
> to take extra measures to ensure that we archive your site, you
> can visit the Alexa "[22]Archive Your Site" page.
> 14. How can I help?
> The Internet Archive actively seeks donations of digital materials
> for preservation. Alexa Internet provides access to a web-wide
> crawl that contains copies of the publicly accessible web. If you
> have digital materials that may be of interest to future
> generations, [23]let us know. The Internet Archive is also seeking
> additional funding to continue this important mission. Please
> [24]contact us if you wish to make a contribution.
>
> The Internet Archive Wayback Machine is a service created by [28]Alexa
> to enable people to surf an ongoing archive of the web.
> [
> References
>
> 19. http://web.archive.org/collections/e2k/press_release.html
> 20. http://web.archive.org/collections/web/advanced.html
> 21. http://www.alexa.com/help/webmasters/request_bot.html
> 22. http://www.alexa.com/help/webmasters/request_bot.html
> 23. http://www.archive.org/internet/proposal.html
> 24. mailto:info@archive.org
> 28. http://www.alexa.com/
> ]

[1] http://impressive.net/people/gerald/1999/01/http-archive/

--
Gerald Oskoboiny <gerald@impressive.net>
http://impressive.net/people/gerald/

Re: the Internet Archive Wayback Machine

from Aaron Swartz <aswartz@upclink.com>, Tue, 16 Oct 2001 08:32:35 -0500

Replies:

None.

Parents:

gerald

On Tuesday, October 16, 2001, at 03:21 AM, Gerald Oskoboiny wrote:

> This is so freakin cool!
>
> Internet Archive Wayback Machine
> http://web.archive.org/

Oh neat, didn't realize they'd put a tidy interface on it... I
was still going in the back door. Perhaps then you'll be
interested in:

http://www.televisionarchive.org/
TelevisionArchive
A library of world perspectives...

Perhaps not as good as free satellite television, but still pretty cool.

--
[ "Aaron Swartz" ; <mailto:me@aaronsw.com> ; <http://www.aaronsw.com/> ]

Re: the Internet Archive Wayback Machine

from Hugo Haas <hugo@larve.net>, Wed, 17 Oct 2001 08:40:28 -0400

Replies:

Parents:

gerald

* Gerald Oskoboiny <gerald@impressive.net> [2001-10-16 04:21-0400]
> This is so freakin cool!
>
> Internet Archive Wayback Machine
> http://web.archive.org/
[..]

Very good, I managed to find the Mutt propaganda page I wrote 3.5
years ago and that I erased at some point (never delete anything...):

http://archive1.alexa.com/web/19991007065525/http://www.via.ecp.fr/~hugo/mutt/

I was also amused to see that I wrote at the time:

* Last but not least, it is open-source software (GNU General Public
License), which is not Pine's case...

I had been really surprised when a few months ago, people realized
that Pine/Pico's licenses sucked[2]. This basically wasn't news.

I was trying to see what version I started using, and I found thanks
to Google a message[1] that I sent in comp.mail.mutt about Mutt
0.91.1.

1. http://groups.google.com/groups?q=group:comp.mail.mutt+author:Hugo+author:Haas&start=20&hl=en&scoring=r&rnum=26&selm=6hd10v%24qqm%241%40smilodon.ecp.fr
2. http://slashdot.org/article.pl?sid=01/07/03/1529226&mode=thread
--
Hugo Haas <hugo@larve.net> - http://larve.net/people/hugo/
Y'avait blumaize, en 8 lettres.

Re: the Internet Archive Wayback Machine

from Gerald Oskoboiny <gerald@impressive.net>, Thu, 18 Oct 2001 01:50:30 -0400

Replies:

Parents:

gerald
hugo

On Wed, Oct 17, 2001 at 08:40:28AM -0400, Hugo Haas wrote:
> * Gerald Oskoboiny <gerald@impressive.net> [2001-10-16 04:21-0400]
> > This is so freakin cool!
> >
> > Internet Archive Wayback Machine
> > http://web.archive.org/
> [..]
>
> Very good, I managed to find the Mutt propaganda page I wrote 3.5
> years ago and that I erased at some point (never delete anything...):
>
> http://archive1.alexa.com/web/19991007065525/http://www.via.ecp.fr/~hugo/mutt/

hmm... that URI doesn't work for me (host not found), but this
one does:

http://web.archive.org/web/19991007065525/http://www.via.ecp.fr/~hugo/mutt/

I made a bookmarklet [2] to use to access this archive, so whenever
I get a 404 when trying to find something on the Web, I'm just a
couple clicks away from the most recently archived version! Cooool.

I think this is the biggest upgrade to the Web since Google...
For example, previously it wasn't possible to create persistent
links to Dilbert cartoons, because they disappear after a while.
But now instead of linking to them as:

http://www.comics.com/comics/dilbert/archive/dilbert-20011016.html

you can just link to:

http://web.archive.org/http://www.comics.com/comics/dilbert/archive/dilbert-20011016.html

(and whenever you access something and expect it to disappear
soon, you just need to access the bookmarklet once to cause Alexa
to archive a copy of it.)

I wonder what Alexa will do if e.g. the Dilbert folks complain
about copyright violations; I was surprised that kind of thing
wasn't already covered in their FAQs.

> I was trying to see what version I started using, and I found thanks
> to Google a message[1] that I sent in comp.mail.mutt about Mutt
> 0.91.1.

I started with version 0.93.2 (on Nov 29, 1998, as documented [3] :)

One of these days I would like to figure out when I started using
Linux. I remember running 1.2.8 at home for a while, but don't
remember if that was the first one I used. The other day I was in
an internet cafe in Edmonton, and a guy noticed I was running Linux
and asked how long I had been using it (and if I wanted to set up
a server for him :) I said about 6 years, but wasn't really sure.

[2] javascript:void(self.location='http://web.archive.org/web/'+self.location);
[3] http://impressive.net/people/gerald/1998/#11mutt

--
Gerald Oskoboiny <gerald@impressive.net>
http://impressive.net/people/gerald/

Re: the Internet Archive Wayback Machine

from Hugo Haas <hugo@larve.net>, Thu, 18 Oct 2001 07:39:16 -0400

Replies:

Parents:

* Gerald Oskoboiny <gerald@impressive.net> [2001-10-18 01:50-0400]
> > Very good, I managed to find the Mutt propaganda page I wrote 3.5
> > years ago and that I erased at some point (never delete anything...):
> >
> > http://archive1.alexa.com/web/19991007065525/http://www.via.ecp.fr/~hugo/mutt/
>
> hmm... that URI doesn't work for me (host not found), but this
> one does:
>
> http://web.archive.org/web/19991007065525/http://www.via.ecp.fr/~hugo/mutt/

That's interesting. archive1.alexa.com is what I was getting, and I
couldn't validate the page (I wanted to see if my HTML was valid :-)
because validator couldn't resolve the name... Weird.

> I made a bookmarklet [2] to use to access this archive, so whenever
> I get a 404 when trying to find something on the Web, I'm just a
> couple clicks away from the most recently archived version! Cooool.

Coooool indeed.

[..]
> One of these days I would like to figure out when I started using
> Linux. I remember running 1.2.8 at home for a while, but don't
> remember if that was the first one I used. The other day I was in
> an internet cafe in Edmonton, and a guy noticed I was running Linux
> and asked how long I had been using it (and if I wanted to set up
> a server for him :) I said about 6 years, but wasn't really sure.

I documented that (I had to write about that: something that I
documented and that Gerald didn't had to be recorded).

A long time ago, I added myself[3] to the Linux counter[4]. I started
with Slackware in February 1996, to quickly move to Debian in August.
Life has never been the same since. ;-)

3. http://counter.li.org/cgi-bin/runscript/display-person.cgi?user
4. http://counter.li.org/
--
Hugo Haas <hugo@larve.net> - http://larve.net/people/hugo/
Kids, just because I don't care doesn't mean I'm not listening. --
Homer J. Simpson

Linux use start date (was Re: the Internet Archive Wayback Machine)

from Gerald Oskoboiny <gerald@impressive.net>, Thu, 18 Oct 2001 11:14:46 -0400

Replies:

reagle
ij

Parents:

On Thu, Oct 18, 2001 at 07:39:16AM -0400, Hugo Haas wrote:
> * Gerald Oskoboiny <gerald@impressive.net> [2001-10-18 01:50-0400]
> [..]
> > One of these days I would like to figure out when I started using
> > Linux. I remember running 1.2.8 at home for a while, but don't
> > remember if that was the first one I used. The other day I was in
> > an internet cafe in Edmonton, and a guy noticed I was running Linux
> > and asked how long I had been using it (and if I wanted to set up
> > a server for him :) I said about 6 years, but wasn't really sure.
>
> I documented that (I had to write about that: something that I
> documented and that Gerald didn't had to be recorded).

I didn't say I didn't document it... I just don't know where :)

> A long time ago, I added myself[3] to the Linux counter[4]. I started
> with Slackware in February 1996, to quickly move to Debian in August.
> Life has never been the same since. ;-)

Oh yeah, good idea. Hmm... I tried to login to the linux counter site
using gerald@cs.ualberta.ca, and no worky. So I tried gerald@pobox.com,
and it said "sending your password to that address"; unfortunately I
let that address lapse a few years ago and someone else took it over.

But I just noticed it's available again, so I applied for it and got
the linux counter to resend me my password, and found my record:

http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=22562

("Started: sep 1995", i.e. exactly six years ago. Good guess :)

I think that's probably pretty close to when I started using it;
I don't remember if I filled out the linux counter info during my
first install, or when.

Oh... once I went through all that, I remembered that I could
probably find out the info using a Google search. [5]

I want to make a web page to keep track of this kind of thing
(various public profiles of myself), e.g. my profiles on slashdot,
advogato, sourceforge, etc.

> 3. http://counter.li.org/cgi-bin/runscript/display-person.cgi?user
> 4. http://counter.li.org/

that should be:
http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=28230

BTW, the Linux counter site misuses HTTP GET [6]; I'll have to
flame the maintainer about that. (tsk, the grand poobah of the
IETF [7] should know better!)

[5] http://www.google.com/search?q=oskoboiny+%22linux+counter%22
[6] http://mail.python.org/pipermail/mailman-developers/2001-July/009086.html
[7] http://www.alvestrand.no/ietf/

--
Gerald Oskoboiny <gerald@impressive.net>
http://impressive.net/people/gerald/

Re: Linux use start date (was Re: the Internet Archive Wayback Machine)

from Joseph Reagle <reagle@mit.edu>, Thu, 18 Oct 2001 11:38:57 -0400

Replies:

Parents:

On Thursday 18 October 2001 11:14, Gerald Oskoboiny wrote:
> On Thu, Oct 18, 2001 at 07:39:16AM -0400, Hugo Haas wrote:
> > * Gerald Oskoboiny <gerald@impressive.net> [2001-10-18 01:50-0400]
> > A long time ago, I added myself[3] to the Linux counter[4]. I started
> > with Slackware in February 1996, to quickly move to Debian in August.
> > Life has never been the same since. ;-)

Ah, you newbies! <grin/> I first used linux back in 93/94 when my friend
Kevin (on the same floor in our dorm) installed it on his PC. It was sweet
because we could have one tty for emacs, and the other for the compiler,
and we didn't have to use the crappy vt100 terminals in the lounge, or go
all the way to CS to use the nice X boxes (HP9000s or SGIs). Before the
winter break I created my pile of ~42 floppies and my brother and I
installed it at home over the break, though we never could get the video
driver to work without giving us (literal) headaches.

That's also the time I got serious RSI and had to back off geeking all
together, and only eased back in very timidly by writing policy papers in
Word on my Win95 box (which was much less tempting to tweak, just had to
limit myself to email and Word).

--
Regards, http://www.mit.edu/~reagle/
Joseph Reagle E0 D5 B2 05 B6 12 DA 65 BE 4D E3 C1 6A 66 25 4E

* This email is from an independent academic account and is
not necessarily representative of my affiliations.

Re: Linux use start date (was Re: the Internet Archive Wayback Machine)

from "Ian B. Jacobs" <ij@w3.org>, Fri, 16 Nov 2001 18:51:04 -0500

Replies:

None.

Parents:

Joseph Reagle wrote:
>
> On Thursday 18 October 2001 11:14, Gerald Oskoboiny wrote:
> > On Thu, Oct 18, 2001 at 07:39:16AM -0400, Hugo Haas wrote:
> > > * Gerald Oskoboiny <gerald@impressive.net> [2001-10-18 01:50-0400]
> > > A long time ago, I added myself[3] to the Linux counter[4]. I started
> > > with Slackware in February 1996, to quickly move to Debian in August.
> > > Life has never been the same since. ;-)
>
> Ah, you newbies! <grin/> I first used linux back in 93/94 when my friend
> Kevin (on the same floor in our dorm) installed it on his PC.

I, too, started with Slackware in 1994. I had just moved back from
France to the US. In France, I had been working with Sparc stations
at INRIA, and I wanted to stick with Unix. I had something like v 0.9
of slackware, which I installed with no assistance on a Zeos desktop.

Years later, I migrated to RedHat on advice from various people at W3C.
Later, I migrated to Debian on advice from those same people.

Leave me alone. :)

_ Ian

--
Ian Jacobs (ij@w3.org) http://www.w3.org/People/Jacobs
Tel: +1 718 260-9447

Re: the Internet Archive Wayback Machine

from Hugo Haas <hugo@larve.net>, Thu, 18 Oct 2001 11:21:04 -0400

Replies:

None.

Parents:

* Hugo Haas <hugo@larve.net> [2001-10-18 07:39-0400]
> A long time ago, I added myself[3] to the Linux counter[4]. I started
> with Slackware in February 1996, to quickly move to Debian in August.
> Life has never been the same since. ;-)
>
> 3. http://counter.li.org/cgi-bin/runscript/display-person.cgi?user

Hmmm... I truncated the URI while pasting:

http://counter.li.org/cgi-bin/runscript/display-person.cgi?user=28230

--
Hugo Haas <hugo@larve.net> - http://larve.net/people/hugo/
Everybody gotta wear clothes, and if you don't, you get arrested. --
Mr. T