Feed: Brewster Kahle & The Largest Library In History

Replies:

  • None.

Parents:

  • None.
http://www.feedmag.com/re/re392_master.html

> You don't really, truly understand Brewster Kahle until you've
> had him show you the server farm in Alexa Internet's basement.
> Walk down a flight of outdoor steps at the side of a old military
> personnel processing building in San Francisco's Presido, and
> you'll see an entire universe of data -- or at least a bank of
> dark-toned Linux servers arrayed along a twenty-foot wall. The
> room itself -- moldy concrete, with a few spare windows gazing
> out at foot-level -- might have held a lawn mower and some spare
> file cabinets a few decades ago. Now it houses what may well be
> the most accurate snapshot of The Collective Intelligence
> anywhere in the world: thirty terabytes of data, archiving both
> the web itself, and the patterns of traffic flowing through it.
>
> As the creator of the WAIS (Wide Area Information Server) system,
> Kahle was already an Internet legend when he launched Alexa in
> 1996. Described as a "surf engine," the Alexa software used
> collaborative-filtering-like technology to build connections
> between sites based on user traffic. The results from its
> technology are showcased in the "Related sites" menu option found
> in most browsers today. Amazon.com acquired Alexa Internet in
> 1999, but the company remains happily ensconced in its low-tech
> Presidio offices, WWII temporary structures filled with the smell
> of the nearby eucalyptus trees.
>
> During our half-hour conversation in Alexa's makeshift conference
> room, Kahle jumps up excitedly at several points to sketch a
> graph out on the whiteboard. We speak about the large-scale
> trends in Web traffic, the history of libraries, and how to build
> a business model for small publishers. After our talk, he takes
> me down to the basement to see the servers. "In just three years
> we got bigger than the Library of Congress, the biggest library
> on the planet," he says, arms outstretched, smiling. "So the
> question is: What do we do now?"
>
> -- Steven Johnson

http://www.feedmag.com/re/re392_master2.html

> FEED: What has changed since you started Alexa in your perception
> of what the overall Web looks like? You probably know more about
> the overall distribution of things -- because of what Alexa does
> -- than just about anyone. Are you surprised by what it looks
> like?
>
> BREWSTER KAHLE: Continuously amazed, surprised, bewildered by
> what's going on -- it's completely fun and interesting to just be
> in the soup. By being here at Alexa, we've got the biggest
> collection of what the current Web looks like now, and in the
> past, as well as where millions of people are surfing. We don't
> know who's who and we don't care, but we can get kind of an idea
> of how the use of the Net evolving, as opposed to what's just on
> the Net. So some of the things that I find really astounding is
> this graph that came out of the Alexa Research Group, which is a
> graph of the number of different Web sites and then what
> percentage of traffic is going to those top-end Web sites. And
> it's amazingly linear.
>
> [Jumping out of his chair to draw on the whiteboard.] So if you
> put it on a semi-log graph where on the x-axis there's 10 Web
> sites, 100 Web sites, 1,000 Web sites, 10,000 Web sites, 100,000
> -- and you put percentage of all traffic worldwide on the y-axis.
> (We're using the 500,000 people that use Alexa on a day-to-day
> basis.) If you have 20 percent, 40 percent, 60 percent, 80
> percent, 100 percent -- it's amazingly linear. So the top ten Web
> sites get 20 percent of the traffic on the Net. The top 100 get
> 40 percent. The top 1,000 get 60 percent. The top 10,000 get 80
> percent and then it tails off, because by this way of counting
> what a Web site is there are about 7 million Web sites. But it's
> almost linear. Just astounding.
>
> Now why is that interesting? I think there's three interesting
> things about the graph. First, there's the concentration. Second,
> there's the long tail. And third, that it's flat. Now, why those
> three?  The top ten Web sites -- by controlling 20 percent of
> what everybody on the Net sees -- is an astounding concentration
> of power that we probably haven't seen since the Roman Empire, in
> the sense that this is worldwide -- we have as many people using
> our panel in Japan as MediaMetrics has in the United States. So
> worldwide people are looking at 10 Web sites. Those companies
> have an astounding ability to put things in front of people that
> can influence them -- whether they do or not, who knows -- so
> it's not like there's just CBS, NBC, and ABC in the United
> States. These top ten are worldwide. So it's an astounding
> concentration. And, you know, you can extend that group -- maybe
> not just the top ten, but the top hundred is forty percent --
> that's a lot. There's a tremendous concentration.
>
> Then there's the long tail. That's a sign that there are people
> who do make niches out of things -- the top one millionth Web
> site might be the absolute best Brazilian stamp collector site.
> That there are niche players that are still important, that are
> down in the hundred thousand to million ranking of Web sites,
> which is the kind of the original dream of the Web -- that, if
> you have something good to say, you'll find your audience, and
> they'll find you.
>
> FEED: That's the story of FEED!
>
> KAHLE: (laughs) Exactly. FEED is an example of something that
> probably couldn't have existed in the land of the print
> distribution nightmare. So the tail is astoundingly long. So
> we're not in a world where you have to be in the top ten or you
> lose completely. The other point is that the line is flat -- now,
> why is that interesting? It means that there's class mobility,
> that if you are the 100,000th Web site there's nothing really
> startling to stop you from becoming the top 100th.  You're just
> have to be better to more people. So by having a flat curve means
> that there's class mobility, I think. And we've studied Web sites
> that have broken into the top 100. And where did they come from?
> There are some portals in foreign countries that are breaking
> into the big time.  So -- Italy and Korea have three portals
> break into the top 100 in the last eight months -- coming from
> pretty far down. So welcome to the Net, Italy. We can see them
> turning on, and we can see when different countries really start
> to rival other countries in terms of their Net penetration.
>
> FEED: One of the things that's always been amazing about Alexa,
> and I think that people are increasingly realizing the power, is
> not just that you're able to see all this information about
> traffic patterns but that information slightly processed is being
> fed back to the users.
>
> KAHLE: It's a big give and take.
>
> FEED: And is there more stuff you'd want to do in that way? In a
> way, that's kind of what cities do: They say, "Look, there's this
> pattern over the last 50 or 100 years of all the artists tending
> to move to this neighborhood." What a city does is make that
> pattern visible to people who are visiting for the first time --
> and that pattern changes their behavior accordingly because they
> either want to be around the artists or they want to avoid them.
> Can the Web do more of that sort of thing?
>
> KAHLE: Oh, yes. Auto-cataloging is the only way to scale. It
> costs forty-five dollars to catalog a book in terms of just
> taking author, title, when it was copyrighted, what subject index
> should it go into. Forty-five dollars! The Web is about twenty
> million different sites in terms of content areas that sort of
> make sense to catalog -- and it's growing at an astounding rate.
> That would mean if you tried to catalog it by hand and tried to
> scale it the size of the Net, that you'd have to spend almost a
> billion dollars to catalog the Web today. And a year from now,
> you'll have to spend another billion dollars. And it won't be up
> to date. So we needed new techniques. The search engine work that
> was done in the sixties by Gerald Saltman and Mike Lesk --
> phenomenal work. It's been doing great. But if we're really going
> to get an idea of what the Net looks like -- when every suburb of
> Denpasar is on the Web and that their soccer team schedules are
> on the Net, and you're trying to find where the game is, how are
> you going to find it by typing a few key words? You're not.
> You're going to have to have these tools that go and say: "these
> are the suburbs of Denpasar on our Web sites." It has to be
> automated, otherwise my worst nightmare is it all becomes five
> thousand channels of nothing on the Web.
>
> FEED: Do you think that Yahoo gets this?
>
> KAHLE: I like Yahoo a lot. I think Yahoo has done a consistently
> great job of packaging the Web experience and making it
> accessible to everybody. And they've stayed pure to the Web in
> many of the aspects that I love about the Web, like directing
> people off-site. It doesn't feel like I'm in a cage when I'm on
> Yahoo. They do inject technology into their world, but they seem
> to be mostly a portal, a medium. And as these technologies
> mature, I'm sure they'll start leveraging it if it's useful to
> people. But they don't seem to quite take advantage of trying to
> trap you in Yahoo world as much as one could imagine.
>
> FEED: What are you doing to support all this? What's the
> infrastructure now that you have?
>
> KAHLE: Most of the machines are actually in this building. All
> the service machines are in a colocation facility. So that those
> are more reliable in the sense that you don't get hit by the
> blackouts that can knock out the couple connections that we have
> into this building. So the data mining goes on here, but the
> actual service is operated out of the colocation facilities. And
> it's just banks and banks and banks of machines.
>
> We now have about thirty terabytes of archival material that we
> data mine. And that's 1.5 times the size of all of the books in
> the Library of Congress. So we're now at an interesting point,
> we're now beyond the largest collection of information ever
> accumulated by humans. We've gotten somewhere! [laughs] We use as
> our original inspiration the Library of Alexandria. Because they
> were the first people that tried to collect it all. And they
> started to actually understand the intersection between
> completely different self-consistent belief systems. They knew
> what the Egyptians, Romans and Greeks, Hebrews, Hittites,
> Sumerians, Babylonians -- they knew the mythologies, because they
> had it all in one place. And they had the scholars to stare at it
> and try to make the disjunctions conjunctions and start to get an
> idea of what humans are. The dream is that we're in another one
> of those positions. They got up to five hundred thousand books.
> Of course, they were scrolls. The Library of Congress -- the
> largest library now -- is seventeen million. Only thirty four
> times more than what we had in 300 B.C. It indicates that the
> technology hasn't scaled. But now we've broken through into a new
> technology that allows us to bypass the Library of Congress in
> very little time, and the sky's the limit. What can we discover
> about ourselves as a species? As different peoples? Are we couch
> potatoes or do we actually have independent will? Do we have
> interests that go beyond the fifteen demographics of slotted
> marketing hell? And what we're finding is, people are
> interesting, diverse and peculiar. They are constantly looking
> for new things that are of interest to them.
>
> FEED: How do you measure that?
>
> KAHLE: The number of different Web sites people go to. It's the
> long tail, and it's a growing tail -- people don't just find the
> five Web sites in general. People think they do -- if you ask
> people, what are the five Web sites you go to, there'll be five
> of them on that list 'cause that's kind of all you can remember,
> plus or minus two or something. But, in general, people do stray
> around. And especially if we can keep the diversity and the
> quality of the Internet in the public sphere, we can develop a
> really much more interesting culture -- just because there's more
> available for people to build on and grow from.
>
> FEED: Is there anything in the basic architecture of the Web
> that's missing that you wish had been put in?
>
> KAHLE: Oh yeah, absolutely.
>
> FEED: What would be at the top of that list?
>
> KAHLE: A business model -- at a small-scale publisher's site.
> Minitel is the system in France that was absolutely fantastic in
> the early eighties, where they put in all these terminals. They
> were trying to build these terminals into five to six million
> homes in France, and they made it really drop-dead easy to make a
> service. You could basically take an IBM PC, you get this special
> card from Minitel, and you could be a server. There were sixteen
> thousand servers in 1988. And if people went to those servers,
> they got charged, kind of like a 900 number. But the prices often
> can be quite low. The popular sites made money. And when we came
> out with ways that...the Web, it all came out of the wrong
> places. And the people that did have an ability to put in a
> business model didn't extend it to the Web. And a lot of the
> economics had turned into something quite bizarre -- in which the
> advertising world tends to benefit the large-scale publishers.
> And you tend to have a collapse of the number of those publishers
> over time, based on the dynamics of ad sales and the like. But
> the royalty system of books has preserved a diversity of book
> publishing that is unparalleled in magazines, newspapers, video.
>
> FEED: Is it too late to insert that somehow?
>
> KAHLE: No, but it will be difficult. It will have to be seen as
> in the interest of the big people. But then I think you can cause
> another level of renaissance. But with the invention of the book,
> the royalty structure took till 1600. It took a hundred and fifty
> years, you get all these complaints of Voltaire not making money
> or Cervantes dying a pauper even though he published the most
> popular book of his time.
>
> FEED: Generally the venture capitalists don't like it when you
> tell them that it's about a hundred and fifty year cycle that we
> have to go through until the model works.
>
> KAHLE: I hoped we could have learned from it and done things a
> little bit faster. But we screwed up. We didn't make it easy for
> small-scale publishers to get paid. And I think the right place
> to tax is the ISPs. Because the ISPs provide the end-user access
> for a fee. They've got a billing relationship and if we set up
> something like an ASCAP, public clearance center, that would
> allow the distribution of some percentage of what's collected
> from the users back to the content, that makes it such that
> people want to be online. Right now, people are paying all of
> their money to use ISPs but the ISPs don't have to pay for the
> content.
>
> FEED: So how would you get to that? Would you regulate it or
> would you just start blocking, would you organize all the content
> sites and block ISPs that didn't participate?
>
> KAHLE: I don't know. And it's gruesome. The development of ASCAP
> is a union-style story -- there were, you know, windows being
> broken, arms being broken, it was a bad news sort of situation to
> get it going. It's usually a lot easier to do it early on. AOL is
> in probably the best position to start it up. But why should
> anybody be first? If the content is free, then why pay for it?
> In fact, AOL goes the next step of, "Shouldn't they pay us to get
> to our users?" So I don't think it's going to come from them.
>
> FEED: What can you tell me about what you're working on now?
>
> KAHLE: Extending Alexa into the realm of helping people with
> products. So if they're shopping on the Net or doing information
> on a product, so instead of just information about Web sites and
> Web pages, extend it so that it's information about products that
> are on Web sites and on Web pages. Because that's a very sticky
> feature: If we can help save you money or have you make a better
> purchase faster by using our free widget, then people will like
> us. And so we've been spending a lot of time trying to understand
> what are our products and where are they. So we just did a data
> mining pass, we've got this parallel cluster with the archive of
> the whole Net. Looked for all the ISBN numbers all over the Net,
> and there were about 550 million unique pages in the collection
> we were looking over. Those are unique pages. There were about 56
> million instances of an ISBN number, and if 56 million pages had
> some ISBN number, a vast majority of those were either Amazon on
> pages or pages that point to Amazon. But there are about 10
> million other ISBN numbers all over the Net. We can help people
> when they're on those pages be able to find whether it's
> available on Amazon, Barnes & Noble -- you know, books anywhere
> -- and compare the prices. So the idea of Alexa is you can
> basically find information about the products on the Web page.
> We're in alpha-test now, and we're going to be launching much
> more actively in October, November, December.
>
> FEED: Are you having fun being part of Amazon?
>
> KAHLE: Love it. They're really good people. At some point you
> have to love your work and you have to love your coworkers. And
> the people at Amazon -- they have this gonzo, go for it, you
> know, "how hard could it be?" attitude. And we love being here in
> the Presidio. A setting helps. If you're going to think big
> thoughts and new thoughts, putting your company in a national
> park in the middle of San Francisco is going to make you think.

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

HURL: fogo mailing list archives, maintained by Gerald Oskoboiny