Reputation and Trust

by Gerald Oskoboiny

(this is very drafty; it became a bit unfocused and kind of got away from me. Last major edit was Sep 2005; have only been adding links since then)

Reputation and trust in real life

In real life, we trust people we know personally somehow: friends, friends of friends, and people and businesses they use or recommend.

We should use online relationships/connections to decide who to trust online as well.

@@ needs work

The (only) solution to spam and phishing

Spam filtering technology has come a long way in the last few years, and email administrators are now able to block 99.9+% of incoming junk mail with few false positives. Efforts are underway to improve this further, by allowing the sender of messages to be authenticated, e.g. using SPF, Sender-ID, DKIM, or traditional PGP signatures.

But even if we know where mail is coming from, we still need to be able to decide if a given source is trustworthy or not. Those of us who have been online for years have various tricks to try to figure that out, but most people don't know how.

Various phishing attacks like IDN spoofing come and go, but the basic problem of reputation and trust will always be there -- most users have no way of knowing that paypal.com is more trustworthy than paypal-security.com or paypa1.com (paypa one dot com) or pаypal.com (paypal.com with a unicode character U+0430 (а), "CYRILLIC SMALL LETTER A".)

The only way to solve that problem is for browsers and e-mail clients to display some kind of trust meter whenever they present web pages or email messages to users.

But... how should they decide which sites are trustworthy?

Decentralized trust

Some users may be happy with a centralized source of trust information, but the whole world is never going to agree on a single source of that info. Even within a certain geographical region or interest group, people will have wildly differing views on which sources of data are trustworthy.

Many users might be comfortable trusting Microsoft to tell them which sites are legitimate or not, and MSN could probably afford to build such a system just for their users, but I wouldn't want to trust it myself.

Many people currently place a lot of trust in Google because of their generally good track record of filtering good sites from bad and their Don't be evil policy, but like any public corporation they will only stay free from evil as long as doing so is profitable. So I wouldn't want to trust them indefinitely either.

So it is clear that web browsers and email user agents will need to allow users to select from multiple independent sources of trust data. Ideally they would allow trust data to be aggregated from a variety of sources, possibly based on some hierarchy or mathematical model. I may want to configure my browser to trust Google's notion of a site's reputation by default, but override Google's data in some cases, for example to exclude specific search engine spammers or phishing sites as they become known.

Email administrators already do this -- typically, sites will subscribe to a set of blacklist(s) of known bad guys (e.g. via DNSBLs), but override that with a local whitelist of sites they never want blocked. The popular SpamAssassin mail filtering system calculates spam scores for messages using hundreds of tests with weighted scores; the most useful of these has proven to be URI blacklist and email message checksum data published by various sites. The final decision of which of these sites to trust is left to the site's email administrator, and may be overridden by individual users.

Use Cases

When considering how to build a system to determine reputation info for a given individual or organization, it is useful to keep in mind specific use cases that we would like to handle.

Short term, deployable immediately

Is paypal.com a trustworthy web site?
gerald@impressive.net is trying to send me email. Do I want to receive it?
lists.w3.org is trying to send/relay some mail. Do I want to receive it?
A new Internet user comes online and wants to be able to send email to her friends.

Medium term, deployable with a bit of new infrastructure

Is this software really something I should be installing on my system? (examples: Debian packages, SpamAssassin tests, Exim ACLs, Perl libraries, Firefox extensions, Greasemonkey scripts, Apache modules)
I want an applet on my desktop that displays the current temperature in my city (these exist, but probably rely on fixed screen scraping code and/or data feeds from specific web sites; should be updated to be more robust)
When I browse the web, I want pages to be filtered and modified for usability according to the preferences of people I trust (cf. Greasemonkey)
I want to read the best 10% of a mailing list, as rated by people I trust

Long term

Is this hotel recommended by people who have stayed here in the past 3 months?
Is this restaurant well-reviewed?
Does this person have a history of paying their rent on time?
Can I trust this person to sell me something? (It should be possible to implement ebay's "feedback history" in a decentralized way)
I'm hitchhiking in a foreign country; can I trust this person to pick me up?
Can I trust this hitchhiker?
Does this politician have a track record of keeping their commitments?
Is this person trying to enter my country a terrorist?
Does this person have a good employment history?
Whuffie-pinging a la Down and Out in the Magic Kingdom

@@ move examples to some section below with more detail?

Abuse cases to keep in mind

Shouldn't be possible for bad guys to affect someone's reputation in a negative way, e.g. by sending out a bunch of spam with someone else's URI in the body.
weev's claim of gaming amazon's rep system leading to #amazonfail etc.
Schneier on related issues (bad actors framing others)
don't allow people to exploit loopholes (e.g. claim to be Anonymous on mefi to reap its 'favorited by others' karma)
@@ others

How to get there from here

bootstrap using existing data sources, screen scraping; use our own system to maintain libraries of screen scraping code

Existing sources of reputation information

google pagerank
alexa site ratings
amazon reviewer info/ratings
slashdot
FOAF
advogato
k5
sourceforge
ebay
yahoogroups (could expose data a la "gerald@impressive.net has been a group admin with 99% positive feedback since 1996")
livejournal, orkut, friendster, myspace, others
CPAN module maintainers
debian package maintainers
resellerratings
epinions
epicurious, recipezaar
del.ico.us
blogdex
technorati authority
better business bureau
TRUSTe
URIBLs, RBLs
evite (knows about friend networks)
paypal.com has Reputation Numbers which includes the number of verified buyers to date, account creation date, and length of time the person has been a paypal member. (Not sure if this info is publicly available, or only during specific transactions.)
O'Reilly and Gates mentioned using MS Outlook's knowledge of social networks as a source of reputation data
metafilter publishes counts of 'favorited by others', number of users linking to people, number of posts, comments, etc. See also: mefi contribution index

Next steps

things to do, projects for the enthusiastic:

Google should implement a negative pagerank, to allow anyone to publish links that convey negative pagerank to other sites. This would effectively decentralize the process of dealing with search engine spammers. (rel=nofollow to assign zero pagerank is a start, but negative pagerank is needed as well)
scrape reputation data from existing sources, publish in semweb-happy and/or DNS queryable formats
firefox extension to display trust level a la pagerankstatus
GMail should display a trust level along with email messages (if they don't, maybe someone else can hack it to do so?)
hack other email clients, browsers to display trust levels
once we have some form of reputation data, implement spamassassin rules to take advantage of it
run whois on domain names that appear within incoming email (headers and body), check the reputation of domains that appear within
figure out a way to prove that an identity has been around for a long time. (possibilities: 'Created' info in whois records? google groups searches for old posts from that email address?) Potential problem: just because an address has been in use for a long time doesn't mean it's a good guy, or that it has been in continuous use by that person. But exceptions should be rare enough that this info would still be useful?)
spamassassin rule that checks the pagerank of domain(s) of incoming email?
keep track of how much signal vs noise is received by W3C mail hubs from various IP addresses and/or networks. (and expose this info to the public?) implementation notes: grep mail hub logs? rejecting spam at SMTP time provides us with less data to incorporate into our calculations; use fakerejects instead? or reject but also keep track of interesting data (relaying IP, envelope sender, visible From:)
implementations of communities like orkut/friendster based on FOAF (probably several exist already, but how to do it in a way that gets widespread use?)
business cards should have machine-readable id info (email addr, PGP sig); exchanging cards adds each other to your web of trust; banks can give people machine-readable cards they can scan to establish a high degree of trust with their web site(s)

software libraries, configuration updates

Anyone who wants to use stock quotes in their apps has to implement screen-scraping code to grab stock quotes from Yahoo or somewhere; whenever Yahoo changes their page layout or URIs, thousands of people have to update their code. For Perl, there are libraries in CPAN to do this, but such libraries typically aren't updated often enough and quickly enough to be relied upon. It should be possible for anyone to publish a stable API for such a service ("current stock quotes"), and for anyone else to subscribe to this code using RSS and a web of trust to autoupdate their code. (Dapper sounds like it aims to solve related problems; haven't looked at it closely)

This general idea can be applied to many other projects besides Perl libraries:

spam filters (spamassassin rules)
greasemonkey scripts
mailman code updates, new features
screen scraping code (to extract data from web pages that are not semweb-happy)

Almost all of W3C's servers run Debian GNU/Linux, and auto-upgrade themselves twice a day. We have effectively delegated the maintenance of our core systems to the Debian organization; we trust them to decide which specific revisions of thousands of software packages are the most secure and free of bugs. This may sound dangerous, but we have been doing it for years (and did the same with Redhat RPMs for years before that), and the Debian project has an excellent track record. Basically, we have decided that we trust them to do this job better than we can, given our limited resources. (or rather, we choose to spend our resources elsewhere.) It doesn't matter that we don't know who maintains each software package.
I have discovered lots of good music using Amazon's "people who like this artist also like ..." features
slashdot would be unusable if not for distributed moderation

Bad ideas

examples of the wrong way to do things:

the "locked" icon in web browsers is not very useful, and may actually have the effect of convincing users that a phishing site is legitimate just because the connection there happens to be encrypted.
using many different email addresses for the same person (e.g. in spam avoidance techniques, or whois registries) causes karma to be diluted among many different identities; @@ write this up
Yahoo News message boards have no real user community/moderation features; as a result, they are cesspools.
in general, any community that lets users publish files without restriction ends up full of spam and porn.

Todo

@@ semweb motivation, techniques

other things to cover somewhere:

incentive for users to care about their reputation?
incentive for users to care about having stable identities
incentive for music groups to care about their identities (I want to be able to configure my agent to notify me whenever certain artists are nearby) Related: tourb.us, others
Google should expose their pagerank data as reputation info (related?: a tour of the google blacklist)
incentive for publishers of reputation data? offer reputation info as a paid service? Or common phishing targets could get together to fund a free trust network? Or my bank could license a trust feed from Google, and in turn license that feed to me as a web service? Or a govt org could offer that service for its citizens? Or a company or nonprofit could be set up that aggregates reputation data from other sites (paying for certain feeds) and licenses the results back to users.
unresolved issues:
- how to establish trust while being anonymous?
- different axes of trust: just because I trust someone's advice on software updates doesn't mean I trust them on music or restaurants
- @@ left whuffie and right whuffie
- how to announce to the world that one's identity has been stolen? how to recover from identity theft?
- privacy issues
bitzi... which of the available encodings of this movie is the best quality?
John McCarthy (of LISP fame) on vote delegation (@@ reference? Mentioned by DanC once)
re-read Templeton on ending spam
openid
mailing list co-moderation
PGP key signings
firefox anti-phishing stuff

References

this article inspired in part by...

old stuff in FoRK archives
meng's reputation stuff, AGUPI
PGP web of trust
google pagerank
Paul Ford's writings on the Semantic Web
sandro's single signon (@@ public version?)
Cringely's I'm With Stupid: How Having Friends Might Be the Key to Both Privacy and Identity
Salon: You are who you know
k5's mojo (and successors?)
conversations with other W3Cers, esp. DanC, Danbri, Sandro, TimBL

things I haven't read but probably have ideas worth stealing:

raph's thesis on trust
brin's transparent society
Trust and Reputation in Web-based Social Networks
Outfoxed
Netscape's Trust Rating System (has problems)

various related articles:

FOAF Plus OpenID
Wired article on Greasemonkey
Yahoo's DomainKeys docs mentions building and sharing reputation profiles for email senders
Anonymity and Online Community: Identity Matters discusses registered users and reputation
Firefox 2 To Have Anti-Phishing Technology
Crowdsourcing harnesses the wisdom of crowds
this review of Google Co-Op mentions using Google Co-Op to filter web spam sites much along the lines of what I wrote above
Wired UK: Kevin Kelley on the new socialism looks at online collaborative sites reshaping society and the way things are made
Aardvark lets people ask friends and friends of friends questions on various topics; leveraging facebook connections, seems to have good integration with twitter.

existing reputation data/services:

Karmasphere from Meng Wong (seems discontinued as of Nov 2009)
Rapleaf aggregates reputation info from a variety of sources
Launchpad tracks karma for various contributions, see e.g. Mark Shuttleworth's
Stack Overflow seems to have an interesting karma system where having more reputation points gains more power in the community (wonder how well it works)
OpenDNS warns its users about phishing sites
SiteAdvisor looks excellent (Feb 2006)
Google to use TrustRank for News, Possibly More (April 2005)
CipherTrust's TrustedSource
Bonded Sender
Habeas RepCheck
Verisign Trust Mark (was in the news in Nov 2003; current status?)
SpoofStick browser extension to help detect spoofed URIs
petname is a browser extension that lets people save reminder notes about a relationship with a site, keyed to the certificate
ohloh tracks stats about open source projects and developers
not really a general service, but yelp's elite squad is interesting: people who make lots of positive contributions to this online community get elevated status (with perks like free free vip access to a kings of leon concert, lucky bums)
WOT firefox add-on

Last modified: $Date: 2011/05/09 21:54:42 $
Gerald Oskoboiny, <gerald@impressive.net>

Reputation and Trust