Re: Google's site ranking algorithm

At 02:54 8/9/2000 -0400, Gerald Oskoboiny wrote:
>Google (http://www.google.com/) is by far the best web search
>engine out there: I almost always find whatever I'm looking for
>within a few seconds. Altavista and the others don't even come
>close.

I agree! I love to use google.

>Anyway, I've been noticing from my site's referer logs that
>some of my pages are showing up near the top of all kinds of
>random Google searches; for example, it's the #1 site returned
>for "new york city photos" (or "pics"):
>
>    http://www.google.com/search?q=new+york+city+photos
>    http://www.google.com/search?q=new+york+city+pics

My favorite is bouncing boobs. I once noted that query in my logs, found it
amusing and wrote about it [1]. That entry is now #4 on google.yahoo.com [2]
and my logs are full of bouncing boob queries! The pour suckers who follow
the link!

[1] http://goatee.net/9912#13mo
[2] http://google.yahoo.com/bin/query?p=bouncing+boob&hc=0&hs=0

BTW: google.yahoo.com and google.com give different returns, though I only
note one google crawler. Also, google stomps all other search referrers to
my site. In the past four days google has referred to me over 130 requests,
the other search engines are just noise.

  83: http://google.yahoo.com/bin/query
  41: http://www.google.com/search
   6: http://search.netscape.com/google.tmpl
   4: http://google.yahoo.com/bin/query_uk
   3: http://www.altavista.com/cgi-bin/query
   1: http://google.yahoo.com/bin/query_asia
   1: http://search.dogpile.com/texis/search
   1: http://google.yahoo.com/bin/query_sg
   1: http://www.lycos.com/srch/
   1: http://fantomaster.com/faregister.html
   1: http://www.ask.com/main/metaAnswer.asp
   1: http://google.yahoo.com/bin/query_ca

It usually isn't this dramatic, but it's not that uncommon.

>My theory has been that since the W3C HTML validator has so many
>incoming links (tens of thousands, maybe hundreds of thousands),
>and my personal home page is only about three clicks away from
>there, I benefit from the validator's high ranking.

I assumed the length of google's decay was 1! And I've always been surprised
by how kind google is to goatee.net (sending me folks who sometimes didn't
get what they were looking for as obvious by the query string). However, I
have no similar explanation as there aren't any link to goatee.net from any
W3C page or any other mega-traffic sites I don't think. It's usually other
zine/blog/journal type sites. My only explaination is that it has lots of
words. <grin>

>This puts me in kind of a weird position of power -- I can write a
>page on any obscure topic, and anyone who searches for that word
>using Google (or Yahoo, which uses Google) will find my page.

Hrmm.. mind linking to goatee.net? <grin> I say that jokingly because while
everyone likes a full log, I feel sorry for people who land on my page which
is really the wrong end of a query string.
_______________________    
Regards,          http://reagle.org/joseph/
Joseph Reagle     E0 D5 B2 05 B6 12 DA 65  BE 4D E3 C1 6A 66 25 4E
"Life shrinks or expands in proportion to one's courage." - Anais Nin

Re: Google's site ranking algorithm

ObAllPraiseGoogle; that said...

Joseph Reagle wrote:
>
> At 02:54 8/9/2000 -0400, Gerald Oskoboiny wrote:
[...]
>  >My theory has been that since the W3C HTML validator has so many
>  >incoming links (tens of thousands, maybe hundreds of thousands),
>  >and my personal home page is only about three clicks away from
>  >there, I benefit from the validator's high ranking.
>
> I assumed the length of google's decay was 1!

I think Google uses Kleinberg's algorithm. cf


J. Kleinberg. Authoritative sources in a hyperlinked environment. In
Proceedings of the 9th ACM-SIAM
    Symposium on Discrete Algorithms., 1998. See
http://www.cs.cornell.edu/home/kleinber.

cited from

http://www.w3.org/1998/11/05/WC-workshop/Papers/kleinber1.html


But I don't see any admission of that on google's site.

Aha... searching for "google Kleinberg"
http://www.google.com/search?q=google+Kleinberg&btnG=Google+Search
yields:

" As Kleinberg explains, Google first ranks and
then searches, whereas CLEVER searches and then ranks."

-- SEARCH AND SEARCHABILITY
Originally published in Release 1.0,
January 15, 1999 (24 pages in print)
By Kevin Werbach
http://www.edventure.com/release1/0199text.html

So pagerank is sorta like Kleinberg's algorithm,
but not quite the same.

--
Dan Connolly, W3C http://www.w3.org/People/Connolly/

Re: Google's site ranking algorithm

Replies:

  • None.

Parents:

There is a lot of useful information on the Search Engine Showdown[1]
(well, maybe I'm the only one who didn't know this site though).

 1. http://www.searchengineshowdown.com/

--
Hugo Haas <[email protected]> - http://larve.net/people/hugo/
What kind of side dishes will we be enjoying this evening with our
frozen waffles?

Re: Google's site ranking algorithm

Replies:

  • None.

Parents:

At 09:44 8/9/2000 -0500, Dan Connolly wrote:
>J. Kleinberg. Authoritative sources in a hyperlinked environment. In
>Proceedings of the 9th ACM-SIAM
>     Symposium on Discrete Algorithms., 1998. See
>http://www.cs.cornell.edu/home/kleinber.

I get a 404, though the following works (from MIT):
http://www.acm.org/pubs/citations/journals/jacm/1999-46-5/p604-kleinberg/

_______________________    
Regards,          http://reagle.org/joseph/
Joseph Reagle     E0 D5 B2 05 B6 12 DA 65  BE 4D E3 C1 6A 66 25 4E
"Life shrinks or expands in proportion to one's courage." - Anais Nin

Re: Google's site ranking algorithm

On Wed, Aug 09, 2000 at 09:44:40AM -0500, Dan Connolly wrote:
> Joseph Reagle wrote:
> > At 02:54 8/9/2000 -0400, Gerald Oskoboiny wrote:
> [...]
> >  >My theory has been that since the W3C HTML validator has so many
> >  >incoming links (tens of thousands, maybe hundreds of thousands),
> >  >and my personal home page is only about three clicks away from
> >  >there, I benefit from the validator's high ranking.
> >
> > I assumed the length of google's decay was 1!
>
> I think Google uses Kleinberg's algorithm. cf
>
> J. Kleinberg. Authoritative sources in a hyperlinked environment. In
> Proceedings of the 9th ACM-SIAM
>      Symposium on Discrete Algorithms., 1998. See
> http://www.cs.cornell.edu/home/kleinber.
>
> cited from
>
> http://www.w3.org/1998/11/05/WC-workshop/Papers/kleinber1.html

I didn't get around to following most of these references when we
discussed this last month, but this topic came up on the 'robots'
list [1] again today and I found a few more interesting bits,
especially:

   The Anatomy of a Large-Scale Hypertextual Web Search Engine
   Sergey Brin and Lawrence Page
   Computer Science Department, Stanford University

   http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm
   http://www-db.stanford.edu/~backrub/google.html (another copy of same)

> But I don't see any admission of that on google's site.
>
> Aha... searching for "google Kleinberg"
> http://www.google.com/search?q=google+Kleinberg&btnG=Google+Search
> yields:
>
> " As Kleinberg explains, Google first ranks and
> then searches, whereas CLEVER searches and then ranks."
>
> -- SEARCH AND SEARCHABILITY
> Originally published in Release 1.0,
> January 15, 1999 (24 pages in print)
> By Kevin Werbach
> http://www.edventure.com/release1/0199text.html
>
> So pagerank is sorta like Kleinberg's algorithm,
> but not quite the same.

The paper I mentioned above is hit #2 for this query, now that I
actually bother to follow that link.

Details on PageRank [2]:

| 2.1.1 Description of PageRank Calculation
|
| Academic citation literature has been applied to the web, largely
| by counting citations or backlinks to a given page.  This gives
| some approximation of a page's importance or quality. PageRank
| extends this idea by not counting links from all pages equally,
| and by normalizing by the number of links on a page. PageRank is
| defined as follows:
|
|    We assume page A has pages T1...Tn which point to it
|    (i.e., are citations). The parameter d is a damping factor
|    which can be set between 0 and 1. We usually set d to
|    0.85. There are more details about d in the next section.
|    Also C(A) is defined as the number of links going out of
|    page A. The PageRank of a page A is given as follows:
|
|       PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
|
|    Note that the PageRanks form a probability distribution
|    over web pages, so the sum of all web pages' PageRanks
|    will be one.
|
| PageRank or PR(A) can be calculated using a simple iterative
| algorithm, and corresponds to the principal eigenvector of the
| normalized link matrix of the web. Also, a PageRank for 26
| million web pages can be computed in a few hours on a medium size
| workstation. There are many other details which are beyond the
| scope of this paper.

I still haven't read most of this stuff carefully, I'm just
sending this so I can find it again later.

[1] (web) robots discussion list:
   http://info.webcrawler.com/mailing-lists/robots/info.html
   http://www.mccmedia.com/html/discussion.html

[2] http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm#pr

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

Re: Google's site ranking algorithm

As a follow-up on this, GeekPress had an interesting story[1] about how to
cheat with Google's ranking linke from Slashdot.

We should publish something about the W3C Google scam. :-)

 1. http://www.geekpress.com/stories/google.shtml

--
Hugo Haas <[email protected]> - http://larve.net/people/hugo/
Ok children, today we're going to learn about a Japanese poem called
haiku. A haiku is just like an American poem, except that it doesn't
rhyme and it's totally stupid. -- Mr Garrisson

Re: Google's site ranking algorithm

Replies:

Parents:

On Tue, Oct 31, 2000 at 11:46:57PM -0500, Hugo Haas wrote:
> As a follow-up on this, GeekPress had an interesting story[1]
> about how to cheat with Google's ranking linke from Slashdot.

I would love to see search engines start putting sites
like this on permanent blacklists, and then publicizing and
sharing their blacklists with other search engine vendors.
(a la RBL for email spam, http://mail-abuse.org/rbl/ )

It would probably take a fair bit of human effort to research
and verify which sites are doing this though.

> We should publish something about the W3C Google scam. :-)

That's not a scam, just a nice side effect of the way their
ranking algorithm works. :)

>   1. http://www.geekpress.com/stories/google.shtml

--
Gerald Oskoboiny <[email protected]>
http://impressive.net/people/gerald/

Re: Google's site ranking algorithm

Replies:

Parents:

On Wed, Nov 01, 2000, Gerald Oskoboiny wrote:
> On Tue, Oct 31, 2000 at 11:46:57PM -0500, Hugo Haas wrote:
> > As a follow-up on this, GeekPress had an interesting story[1]
> > about how to cheat with Google's ranking linke from Slashdot.
>
> I would love to see search engines start putting sites
> like this on permanent blacklists, and then publicizing and
> sharing their blacklists with other search engine vendors.
> (a la RBL for email spam, http://mail-abuse.org/rbl/ )
>
> It would probably take a fair bit of human effort to research
> and verify which sites are doing this though.
>
> > We should publish something about the W3C Google scam. :-)
>
> That's not a scam, just a nice side effect of the way their
> ranking algorithm works. :)
>
> >   1. http://www.geekpress.com/stories/google.shtml
>
> --
> Gerald Oskoboiny <[email protected]>
> http://impressive.net/people/gerald/

--
Hugo Haas <[email protected]> - http://larve.net/people/hugo/
Marge, it takes two to lie. One to lie and one to listen. -- Homer J.
Simpson

Oops (was Re: Google's site ranking algorithm)

Replies:

  • None.

Parents:

This is what happens when you start writing something, then you change
your mind, try to cancel your mail and press 'y' without reading the
question... Sorry.

--
Hugo Haas <[email protected]> - http://larve.net/people/hugo/
I love it when a plan comes together! -- John "Hannibal" Smith

Re: Google's site ranking algorithm

Replies:

  • None.

Parents:

At 23:46 10/31/2000 -0500, Hugo Haas wrote:
>As a follow-up on this, GeekPress had an interesting story[1] about how to
>cheat with Google's ranking linke from Slashdot.
>
>We should publish something about the W3C Google scam. :-)

I'm non-plussed with Diana's taste in women (Liv Tyler and Heather Lockleer,
it's all about Brittany out there! <grin>), and her concern with this
problem. I agree with Google, I've long recognized one could do this (and I
would actually use a couple of domains as the easiest way to address this by
google is to deprecate the value of links to their own domain (for which W3C
must be the most dense site in the world)).

If the  algorithms work well for 99.9% of the queries, I doubt I'd be
inclined to change it for this <.01% screw case unless it improved all
searches.
__
Regards,          http://www.mit.edu/~reagle/
Joseph Reagle     E0 D5 B2 05 B6 12 DA 65  BE 4D E3 C1 6A 66 25 4E
MIT LCS Research Engineer at the World Wide Web Consortium.

* This email is from an independent academic account and is
not necessarily representative of my affiliations.

HURL: fogo mailing list archives, maintained by Gerald Oskoboiny