Reputation and Trust
by Gerald Oskoboiny
(this is very drafty; it became a bit unfocused and kind of got away
from me. Last major edit was Sep 2005; have only been adding links since
then)
Reputation and trust in real life
In real life, we
trust people we know personally somehow: friends,
friends of friends, and people and businesses they use or recommend.
We should use online relationships/connections to decide who to
trust online as well.
@@ needs work
The (only) solution to spam and phishing
Spam filtering technology has come a long way in the last few years, and
email administrators are now able to block 99.9+% of incoming junk mail
with few false positives. Efforts are underway to improve this further,
by allowing the sender of messages to be authenticated, e.g. using SPF,
Sender-ID, DKIM, or traditional PGP signatures.
But even if we know where mail is coming from, we still need to be able
to decide if a given source is trustworthy or not. Those of us who have
been online for years have various tricks to try to figure that out, but
most people don't know how.
Various phishing attacks like IDN spoofing
come and go, but the basic problem of reputation and
trust will always be there -- most users have no way of knowing that
paypal.com
is more trustworthy than
paypal-security.com
or paypa1.com
(paypa one dot
com) or pаypal.com
(paypal.com with a unicode character
U+0430 (а), "CYRILLIC SMALL LETTER A".)
The only way to solve that problem is for browsers and e-mail clients to
display some kind of trust meter whenever they present web pages or email
messages to users.
But... how should they decide which sites are trustworthy?
Some users may be happy with a centralized source of trust information,
but the whole world is never going to agree on a single source of that
info. Even within a certain geographical region or interest group, people
will have wildly differing views on which sources of data are
trustworthy.
Many users might be comfortable trusting Microsoft to tell them which
sites are legitimate or not, and MSN could probably afford to build such
a system just for their users, but I wouldn't want to trust it myself.
Many people currently place a lot of trust in Google because of their
generally good track record of filtering good sites from bad and
their Don't be evil
policy, but like any public corporation they
will only stay free from evil as long as doing so is profitable. So I
wouldn't want to trust them indefinitely either.
So it is clear that web browsers and email user agents will need to allow
users to select from multiple independent sources of trust data. Ideally
they would allow trust data to be aggregated from a variety of sources,
possibly based on some hierarchy or mathematical model. I may want to
configure my browser to trust Google's notion of a site's reputation by
default, but override Google's data in some cases, for example to
exclude specific search engine spammers or phishing sites as they become
known.
Email administrators already do this -- typically, sites will subscribe
to a set of blacklist(s) of known bad guys (e.g. via DNSBLs), but override
that with a local whitelist of sites they never want blocked. The popular
SpamAssassin mail filtering system calculates spam scores for messages
using hundreds of tests with weighted scores; the most useful of these
has proven to be URI blacklist and email message checksum data published
by various sites. The final decision of which of these sites to trust is
left to the site's email administrator, and may be overridden
by individual users.
When considering how to build a system to determine reputation info for a
given individual or organization, it is useful to keep in mind specific
use cases that we would like to handle.
- Is
paypal.com
a trustworthy web site?
gerald@impressive.net
is trying to send me email. Do I
want to receive it?
lists.w3.org
is trying to send/relay some mail. Do I want
to receive it?
- A new Internet user comes online and wants to be able to send email to
her friends.
- Is this software really something I should be installing on my
system? (examples: Debian packages, SpamAssassin tests, Exim ACLs, Perl libraries, Firefox
extensions, Greasemonkey scripts,
Apache modules)
- I want an applet on my desktop that displays the current temperature in
my city (these exist, but probably rely on fixed screen scraping code
and/or data feeds from specific web sites; should be updated to be more robust)
- When I browse the web, I want pages to be filtered and modified for
usability according to the preferences of people I trust (cf. Greasemonkey)
- I want to read the best 10% of a mailing list, as rated by people I
trust
- Is this hotel recommended by people who have stayed here in the past 3
months?
- Is this restaurant well-reviewed?
- Does this person have a history of paying their rent on time?
- Can I trust this person to sell me something? (It should be possible to
implement ebay's "feedback history" in a decentralized way)
- I'm hitchhiking in a foreign country; can I trust this person to pick
me up?
- Can I trust this hitchhiker?
- Does this politician have a track record of keeping their commitments?
- Is this person trying to enter my country a terrorist?
- Does this person have a good employment history?
- Whuffie-pinging a la Down and
Out in the Magic Kingdom
@@ move examples to some section below with more detail?
bootstrap using existing data sources, screen scraping; use our own system
to maintain libraries of screen scraping code
@@
- google pagerank
- alexa site ratings
- amazon reviewer info/ratings
- slashdot
- FOAF
- advogato
- k5
- sourceforge
- ebay
- yahoogroups (could expose data a la "gerald@impressive.net has been a
group admin with 99% positive feedback since 1996")
- livejournal, orkut, friendster, myspace, others
- CPAN module maintainers
- debian package maintainers
- resellerratings
- epinions
- epicurious, recipezaar
- del.ico.us
- blogdex
- technorati authority
- better business bureau
- TRUSTe
- URIBLs, RBLs
- evite (knows about friend networks)
- paypal.com has Reputation
Numbers which includes the number of
verified buyers to date, account creation date, and length of time the
person has been a paypal member. (Not sure if this info is publicly
available, or only during specific transactions.)
- O'Reilly
and Gates mentioned using MS Outlook's knowledge of social networks as
a source of reputation data
- metafilter publishes counts of
'favorited by others', number of users linking to people, number of posts,
comments, etc. See also: mefi contribution
index
things to do, projects for the enthusiastic:
- Google should implement a negative pagerank, to allow anyone to publish
links that convey negative pagerank to other sites. This would effectively
decentralize the process of dealing with search engine spammers. (rel=nofollow to
assign zero pagerank is a start, but negative pagerank is needed as
well)
- scrape reputation data from existing sources, publish in semweb-happy
and/or DNS queryable formats
- firefox extension to display trust level a la pagerankstatus
- GMail should display a trust level along with email messages (if they
don't, maybe someone else can hack it to do so?)
- hack other email clients, browsers to display trust levels
- once we have some form of reputation data, implement spamassassin rules to
take advantage of it
- run whois on domain names that appear within incoming email (headers and
body), check the reputation of domains that appear within
- figure out a way to prove that an identity has been around for a long time.
(possibilities: 'Created' info in whois records? google groups searches for
old posts from that email address?) Potential problem: just because an
address has been in use for a long time doesn't mean it's a good guy, or
that it has been in continuous use by that person. But exceptions should be
rare enough that this info would still be useful?)
- spamassassin rule that checks the pagerank of domain(s) of incoming
email?
- keep track of how much signal vs noise is received by W3C mail hubs from
various IP addresses and/or networks. (and expose this info to the public?)
implementation notes: grep mail hub logs? rejecting spam at SMTP time
provides us with less data to incorporate into our calculations; use
fakerejects instead? or reject but also keep track of interesting data
(relaying IP, envelope sender, visible From:)
- implementations of communities like orkut/friendster based on FOAF
(probably several exist already, but how to do it in a way that gets
widespread use?)
- business cards should have machine-readable id info (email addr, PGP sig);
exchanging cards adds each other to your web of trust; banks can give
people machine-readable cards they can scan to establish a high degree of
trust with their web site(s)
Anyone who wants to use stock quotes in their apps has to implement
screen-scraping code to grab stock quotes from Yahoo or somewhere;
whenever Yahoo changes their page layout or URIs, thousands of people
have to update their code. For Perl, there are libraries
in CPAN to do this, but such libraries typically aren't updated often
enough and quickly enough to be relied upon. It should be possible for
anyone to publish a stable API for such a service ("current stock
quotes"), and for anyone else to subscribe to this code using RSS and a web
of trust to autoupdate their code. (Dapper sounds like it aims to solve
related problems; haven't looked at it closely)
This general idea can be applied to many other projects besides Perl
libraries:
- spam filters (spamassassin rules)
- greasemonkey scripts
- mailman code updates, new features
- screen scraping code (to extract data from web pages that are not
semweb-happy)
use rss and web of trust to create/publish/use exim config info.
e.g. if I see an idea like "reject mail whose HELO does not match"
or "reject mail whose subject contains raw 8bits", how do I know
whether to trust that advice?
also use RSS+WoT to create/publish/use spamassassin rules,
a la SARE
sa-update
provides automatic updates for spamassassin rules.
also use RSS+WoT to distribute/select greasemonkey scripts for various sites
semweb way may be slow, but let's optimize later.
- Almost all of W3C's servers run Debian
GNU/Linux, and auto-upgrade themselves twice a day. We have effectively
delegated the maintenance of our core systems to the Debian organization;
we trust them to decide which specific revisions of thousands of software
packages are the most secure and free of bugs. This may sound dangerous,
but we have been doing it for years (and did the same with Redhat RPMs for
years before that), and the Debian project has an
excellent track record. Basically, we have decided that we trust them to do
this job better than we can, given our limited resources. (or rather, we
choose to spend our resources elsewhere.) It doesn't matter that we don't
know who maintains each software package.
- I have discovered lots of good music using Amazon's "people who like
this artist also like ..." features
- slashdot would be unusable if not for distributed moderation
examples of the wrong way to do things:
- the "locked" icon in web browsers is not very useful, and may actually
have the effect of convincing users that a phishing site is legitimate
just because the connection there happens to be encrypted.
- using many different email addresses for the same person (e.g. in spam
avoidance techniques, or whois registries)
causes karma to be diluted among many different identities;
@@ write this up
- Yahoo News message boards have no real user community/moderation
features; as a result, they are cesspools.
- in general, any community that lets users publish files without
restriction ends up full of spam and porn.
@@ semweb motivation, techniques
other things to cover somewhere:
- incentive for users to care about their reputation?
- incentive for users to care about having stable identities
- incentive for music groups to care about their identities (I want to be
able to configure my agent to notify me whenever certain artists are
nearby) Related: tourb.us, others
- Google should expose their pagerank data as reputation info (related?: a tour of the google blacklist)
- incentive for publishers of reputation data? offer
reputation info as a paid service? Or common phishing targets could get
together to fund a free trust network? Or my bank could license a trust
feed from Google, and in turn license that feed to me as a web service? Or
a govt org could offer that service for its citizens? Or a company or
nonprofit could be set up that aggregates reputation data from other sites
(paying for certain feeds) and licenses the results back to users.
- unresolved issues:
- how to establish trust while being anonymous?
- different axes of trust: just because I trust someone's advice on
software updates doesn't mean I trust them on music or restaurants
- @@ left whuffie and
right whuffie
- how to announce to the world that one's identity has been stolen?
how to recover from identity theft?
- privacy issues
- bitzi... which of the available encodings of this movie is the best
quality?
- John McCarthy (of LISP fame) on vote delegation (@@ reference?
Mentioned by DanC once)
- re-read Templeton on ending
spam
- openid
- mailing
list co-moderation
- PGP key signings
- firefox
anti-phishing stuff
this article inspired in part by...
things I haven't read but probably have ideas worth stealing:
various related articles:
existing reputation data/services:
Last modified: $Date: 2011/05/09 21:54:42 $
Gerald Oskoboiny, <gerald@impressive.net>