On Wed, Aug 09, 2000 at 09:44:40AM -0500, Dan Connolly wrote:
> Joseph Reagle wrote:
> > At 02:54 8/9/2000 -0400, Gerald Oskoboiny wrote:
> [...]
> > >My theory has been that since the W3C HTML validator has so many
> > >incoming links (tens of thousands, maybe hundreds of thousands),
> > >and my personal home page is only about three clicks away from
> > >there, I benefit from the validator's high ranking.
> >
> > I assumed the length of google's decay was 1!
>
> I think Google uses Kleinberg's algorithm. cf
>
> J. Kleinberg. Authoritative sources in a hyperlinked environment. In
> Proceedings of the 9th ACM-SIAM
> Symposium on Discrete Algorithms., 1998. See
>
http://www.cs.cornell.edu/home/kleinber.
>
> cited from
>
>
http://www.w3.org/1998/11/05/WC-workshop/Papers/kleinber1.html
I didn't get around to following most of these references when we
discussed this last month, but this topic came up on the 'robots'
list [1] again today and I found a few more interesting bits,
especially:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
Computer Science Department, Stanford University
http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm
http://www-db.stanford.edu/~backrub/google.html (another copy of same)
> But I don't see any admission of that on google's site.
>
> Aha... searching for "google Kleinberg"
>
http://www.google.com/search?q=google+Kleinberg&btnG=Google+Search
> yields:
>
> " As Kleinberg explains, Google first ranks and
> then searches, whereas CLEVER searches and then ranks."
>
> -- SEARCH AND SEARCHABILITY
> Originally published in Release 1.0,
> January 15, 1999 (24 pages in print)
> By Kevin Werbach
>
http://www.edventure.com/release1/0199text.html
>
> So pagerank is sorta like Kleinberg's algorithm,
> but not quite the same.
The paper I mentioned above is hit #2 for this query, now that I
actually bother to follow that link.
Details on PageRank [2]:
| 2.1.1 Description of PageRank Calculation
|
| Academic citation literature has been applied to the web, largely
| by counting citations or backlinks to a given page. This gives
| some approximation of a page's importance or quality. PageRank
| extends this idea by not counting links from all pages equally,
| and by normalizing by the number of links on a page. PageRank is
| defined as follows:
|
| We assume page A has pages T1...Tn which point to it
| (i.e., are citations). The parameter d is a damping factor
| which can be set between 0 and 1. We usually set d to
| 0.85. There are more details about d in the next section.
| Also C(A) is defined as the number of links going out of
| page A. The PageRank of a page A is given as follows:
|
| PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
|
| Note that the PageRanks form a probability distribution
| over web pages, so the sum of all web pages' PageRanks
| will be one.
|
| PageRank or PR(A) can be calculated using a simple iterative
| algorithm, and corresponds to the principal eigenvector of the
| normalized link matrix of the web. Also, a PageRank for 26
| million web pages can be computed in a few hours on a medium size
| workstation. There are many other details which are beyond the
| scope of this paper.
I still haven't read most of this stuff carefully, I'm just
sending this so I can find it again later.
[1] (web) robots discussion list:
http://info.webcrawler.com/mailing-lists/robots/info.html
http://www.mccmedia.com/html/discussion.html
[2]
http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm#pr
--
Gerald Oskoboiny <
[email protected]>
http://impressive.net/people/gerald/