Adventures in Technology and Data Science

November 05, 2008

Last night, we saw history in the making as Barack Obama won the race for the White House and became the first African-American person to become President of the United States, after a long and arduous campaign that the NYT calls "near-flawless".

Although consistent hard work, strong organizational skills and an uncanny ability to stay cool under pressure, all played a big part in his win, I suspect that one of the core qualities that brought him so far so fast is his ability to bridge differences and to bring people together, even over heavily-divisive issues. [e.g. check out these articles from Slate magazine, about his work at the Harvard Law Review and his position on abortion. ]

Enough politics! Let's get to technology - well, the technology of politics, at any rate. Specifically, about applying the Wisdom of Crowds, with all the attendant risks, to predict the outcome of the Presidential race.

Professor Sam Wang of Princeton has been tracking WoC data to do just that. Although I don't understand the mathematics of it, his final predictions were very good - check it out!

January 20, 2007

Kathy Sierra recently had a fascinating post in the Creating Passionate Users blog: The "Dumbness of Crowds" , in which she carefully analyzes the popular notion of The Wisdom of Crowds . Given the technology community's current Web 2.0-startup craze, with its heavy reliance on the concepts of community and crowd-sourcing,
this is a very relevant and timely discussion. In her post, Kathy makes
a distinction between two related scenarios: on the one hand,
aggregating knowledge from a collection of individuals working
independently (wisdom), and on the other, a group of people acting
together, such as the behavior of a crowd or the consensus decision of
a committee (dumbness).

Clearly, there are specific constraints that need to be satisfied in order to
ensure that the aggregate results of the crowd actually result in
collective wisdom. These constraints can be catalogued and analyzed
(see the "Failures of Crowd Intelligence" section in this Wikipedia article ); violating any of them will severely degrade the quality of the results.

These pitfalls also affect Web 2.0 communities. In order for the aggregated information to represent "collective wisdom", the community must avoid the following common scenarios:

"Everyone agrees"

Does the community provide enough diversity in viewpoints? The value of the outliers
cannot be overstated. If the size of the community is simply too small,
or the community is too homogeneous, then everyone drinks the same
kool-aid and there is not enough disagreement.

"Ms. Expert says so!"

If strong players within the community have the ability to influence others' votes, then the positions taken by participants are no longer independent.

"Gaming"

Is the voting fair - does it reduce (or prevent) malicious votes? If
participants have the incentive and the ability to subvert the results
for their own personal gain, then the collective solution is not going to be very
meaningful.

"The rich get richer!"

Are there network
effects that affect the outcome? Quite often, the process starts
democratically enough, but once a single solution or viewpoint starts
to get relative traction, positive feedback pushes it to overwhelming adoption. In
other words, the system exhibits unstable equilibrium.

"Lack of Participation"

Are users actively participating? This is somewhat different
from the earlier point about community size; as a practical matter, for the system to work, it must be in
the voter's self-interest to vote and vote fairly, leading to the best
results.

"Voting Format"

Is the voting format implicit or
explicit? An explicit system requires more work from participants and
is also more susceptible to spam or gaming.

Popular Web 2.0 Communities: Search Wisdom or Dumbness?

With these constraints in mind, let us evaluate some of the popular Web 2.0 text search/information
findability solutions based on "crowd-sourcing", to see which applies better: Collective
Wisdom or Collective Dumbness?

[Note: The solutions examined here are among the leading lights in
their respective genres; most of the discussion applies to other
similar engines in each space]

Google:

Google has an incredibly efficient algorithm for harnessing Distributed Collective Intelligence from the global community to improve findability (aka search); this is one of the most successful implementations of this concept.

Properties:
- Data Collection: Implicit
- Summary: Google's approach can be summarized simplistically as: "On
any topic, the information that most people refer to is the most
important, and is what everyone wants to find"
- Approach: Uses static links as a proxy for user votes
- Gaming: Susceptible to spamming and SEO, with no community check on
gaming (by design); thus there is a strong incentive to vote unfairly
for marketing advantage
- Targeting: Heavily targeted for undue influence, due to the strong financial motivation for participants involved- Network Effects: Very strong; voting starts off democratically for
any new topic, but once these effects kick in, it is really hard for
new entrants to gain traction, regardless of their quality

Conclusion:Wisdom, but converging towards big-budget marketing output; outliers are progressively less likely to see the light of day

Wikipedia:
Wikipedia is also one of the most successful implementations for
capturing collective intelligence; the approach itself is not very
efficient, since it depends on manual edits, but the collective efforts
of a large community largely overcome this limitation.

Properties: - Data Collection: Explicit
- Simple Summary: "Anyone can edit the information, so that solutions
revert to the mean, which is accuracy (since different users make
different mistakes)"
- Approach: Relies on direct edits to represent voting and volunteer editors for direct control of content
- Gaming / Network Effects: Much less susceptible to spamming and network effects
- But the process is not democratic enough, since editors exercise significant authority; it's hard for outliers to make their way in
- Targeting: Heavily targeted for undue influence
- Strong incentive to vote correctly "for the good of all"

Another
highly efficient algorithm for capturing Distributed Collective
Intelligence; this is arguably the most successful player in the
crowded online bookmarking space.

Properties:- Data Collection: Implicit-
Simple Summary: "Everyone tags and stores their own links, and everyone
benefits from the aggregate knowledge that can be extracted"- Approach: Relies on users implicitly contributing to the creation of a taxonomy and categorization of content
- Gaming / Targeting: Not very susceptible to spamming, nor heavily targeted (except through good copywriting!)- Network Effects: Some network effects. But the tagging process is
completely democratic, and outliers can easily make their way in
- Users vote for their own self-interest, and votes are likely to be very fair, although individual accuracy may vary widely

Conclusion:Wisdom, the true "Wisdom of Crowds"

Technorati Search:

Properties
for Technorati are very similar to those for del.icio.us, except that
users search for blog posts rather than web pages

Properties:- Data Collection: Implicit
- Simple Summary: "Tags from blog posts bubble up, and are grouped together to form a folksonomy"- Approach: Relies on users implicitly contributing to content categorization- Gaming/Targeting: Not particularly susceptible to spamming or manipulation, nor heavily targeted- Network Effects: There are some network effects, especially the "echo-chamber" effect- In general, users vote in their own self-interest and votes are reasonably fair

Conclusion:Wisdom

Digg:

Digg
uses an interesting approach to find articles/web pages of interest.
Its algorithm is based on aggregating the active voting patterns of
users for harnessing collective intelligence.

Properties:- Data Collection: Explicit- Simple Summary: "Everyone votes on whether a given article is interesting"-
Approach: Relies on users to submit articles and mark them positively
or negatively; results are rolled up to find the most interesting
articles- Gaming: Very susceptible to spamming/gaming (there have been manyarticleswrittenabout it ); collaborative voting and the reputed "gangs of diggers" undermine the independence of votes- Targeting: Heavily targeted for undue influence- Network Effects: Very strong network effects, based on both article and author
- A recent change to the algorithm subverted the democratic principle of "one user, one vote"

January 07, 2007

I recently had the opportunity to discuss Prediction Markets with John T. Maloney of Colabria. The KM and Colabria Clusters®
are open, federated action/research networks - you can find more about them here .

Q
- So let's say that I'm convinced that my company should set up some
PMs to improve prediction accuracy and tease out common (but hidden)
knowledge - how do I convince higher-level management of that?

JTM - Well, you need to make a fairly routine business case. This is the heart of the issue, and the main
reason for crafting an industry consortium, the PM Cluster . So far, PMs have been interesting research
tools for scholars, academics and corporate research scientists. These particular populations are not equipped
to make compelling business cases to management, to create 'best
practices', hence the consortium.

Q - How do you quantify the benefits of the Prediction Market?

JTM - I do not assign these sort of qualitative measures, rather, it is just
part of the toolkit. Tools are heavily dependent on context. What is
accurate in one setting may collapse in another. This has been the
problem since time immemorial -- the flawed focus on tools rather than
context. The tricky part is
finding the right 'contract' or future. The Web based tools are simple,
easy. You need to 'find the pain' like anything else. Examples are Microsoft and their rather bogus
release schedules, Intel and where/when to make a new chip foundry or
HP on how much memory to buy in a month... These are billion dollar questions.

Q - Once you get a prediction market going - how
do you maintain/increase participation?

JTM - Focus on intangible benefits, reputation, experience, outcomes.

Q - Could you expand on that?

JTM - Yes, people are motivated and support what
they create, value, which isn't always apparent or measurable. They operate in complex social value networks
that are hard to see or understand with a conventional
mindset... it is why value network analysis is rising so fast.
Prediction markets and intangible value are closely linked but only a
few have discovered it yet.

Q - What are the best applications
initially, within a company, for a Prediction Market? Sales
forecasting? Project planning? Supply chain predictions?