...Log on to the Chowhound message board for the San Francisco Bay Area and you'll find lengthy threads about where to find, say, the most decadent slice of chocolate cake or the best pajeon (Korean seafood pancakes) in the East Bay. You'll find highly technical analyses of the roasting and brewing methodology of local coffee purveyors.

...Up until fairly recently, one thing you wouldn't find on Chowhound was the kind of star ratings system favored by almost every other restaurant guide, whether in print or on the web — from Frommer's to Zagat to Yelp. On Chowhound, you couldn't give a restaurant any kind of quantitative rating.

In short, it was message boards about food, for and by chowhounds - self selected folks who liked to go off the beaten track to find something interesting to eat. Specifically, they liked to go to the places that were unrated, or rated poorly on other sites, just to find any diamonds-in-the-rough, especially unusual items.

It had no ads, no ratings, no shills (because of strong moderation), and no membership fees. It bootstrapped as a contribution financed community site. Eventually it was sold to CNET, which was sold to CBS, which has added ads and ratings in an attempt to capture revenue.

...Jacquilynne Schlesier, the site's community manager, has been helping to moderate Chowhound since the pre-CNET all-volunteer days. "Our users are incredibly passionate and incredibly knowledgeable," Schlesier says. "But it can be a little daunting if you're someone who's not a long-term chowhound." To help make the process less intimidating, they've revamped the site's restaurant listings — individual pages that have all the basic information about a particular restaurant along with links to relevant discussions on the message boards. It's on these pages that the star-rating feature appears.

Generating revenue is good goal. Most food sites that make money have ratings. Your typical product manager would get this far in their reasoning and implement an industry-standard 5-star rating system. This is what CBS/Chowhound apparently did.

But according to many of the site's devotees, the latest set of changes is particularly "unchowish," in large part because of the star-rating feature. ... Among other criticisms, [the founder of Chowhound] questions how it's possible to "rate a bakery that is horrendous except for one item so great it's worth a 100-mile trip along the same rating scale as a pretty-good diner, an inconsistent high-end sushi place, and an exemplary Italian-ice cart."

This is an excellent point. There is a context mismatch between the discussions (interesting food items) and the rating for a restaurant overall.

Why bother asking Chowhound users for a star rating? It's not like they were clamoring for this feature. This looks like Yelp envy to me. I saw similar lazy product design while at Yahoo! around the time Digg originally exploded in growth - property after property wanted to add "Thumbs up" buttons to everything from the weather to search results. [This was a bad design choice for almost all of them - fortunately, during this me-too frenzy, the legal mess from the posting of the DVD crack key helped most Yahoo! product managers figure out that the Yahoo! audience and Digg's were almost mutually exclusive.]

After spending some time at the Chowhound, I've noticed that those participating in the discussions aren't rating much. I couldn't find a restaurant in my area with more than 5 reviews, and five is probably the absolute minimum number of ratings that should be required for the average rating to mean anything. And even then, the average overall is going to be 4.5 stars - familiar to outside users, sure, but in the end pretty useless as a gauge of quality. And, unless CBS is going to buy ratings from someone else, they will never have enough to be useful in a regional search. Bootstrapping 5-star ratings from scratch is a big mistake.

If not 5-star overall ratings, what else?

Clearly the staff needs to find revenue, and advertising is what they've bet the farm on - so increasing the number of users and user-engagement is required. They had do to something.

But, given just the things discussed in this post, there are several other reputation-based things they could try instead...

1) Let the active board posters determine the context! If it's Best Pastrami Sandwich or Most Exotic Menu - let them give the awards to the restaurant. The simplest implementation of this is tagging, but allowing users to create award categories makes search-ranking easier.

2) Allow discussions/posts to be tagged as well - both with the name of the places that are discussed as well as the same user-generated topics...

3) Allow users to mark a place as a "favorite" which both increases the popularity of the place and puts that place on their profile. Combined with tagging, this is an advertisers dream!

4) Implement a karma system for contributors to discussions, increasing the search-rank value of the businesses they discuss, tag, favorite, etc.

All of these techniques are discussed in detail in our upcoming O'Reilly/Yahoo! Press book: Building Web Reputation Systems, also available in searchable draft form on our wiki.

The Chowhounds have valuable expertise they are sharing, they deserve better tools than a poor copy of every-other restaurant site!

November 11, 2009

5-Star Failure?

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week's entry confirms that poorly chosen reputation inputs will indeed yield poor results.

Pity the poor, beleaguered 5-Star rating. Not so very long ago, it was the belle of the online ratings ball: its widespread adoption by high-profile sites like Amazon, Yahoo!, and Netflix influenced a host of imitators, and—at one point—star-ratings were practically an a priori choice for site designers when considering how best to capture their users' opinions. Their no-brainer inclusion had almost reached cargo cult design status.

This has subsided in recent years, as stars have received stiff competition from hot, upstart mechanisms like "Digg-style" voting (what we, when contributing to the Yahoo! Pattern Library, rechristened as Vote to Promote.) And Facebook's "Like" action (which, I guess, was ahem, "inspired by" FriendFeed though, let us not forget that for a time, also flirted with Thumbs Up & Down rating of feed items.) Definitely, within the past 2 or 3 years, stars 'obvious' appeal as the ratings mechanism of choice is no longer so obvious.

Even more recently, 5-Star ratings fall from grace is almost complete. YouTube fired the first volley, declaring that, by and large, people on YouTube overwhelmingly give 5 stars to videos on that site. (For readers of this site, you'll recall that we blogged about similar J-Curve distributions that are prevalent on Yahoo! as well.)

One of the Web's little secrets is that when consumers write online reviews, they tend to leave positive ratings: The average grade for things online is about 4.3 stars out of five.

And, just like that, as quickly as 'stars are it' rose to prominence, 'stars are dead' is rapidly becoming the accepted wisdom. (Don't believe me? Read the comments when TechCrunch covered the YouTube discovery, and you'll see folks all-but-rushing to prop up a variety of their 'preferred rating mechanism' in stars' place.)

Are stars dead?

This is, of course, the wrong way to frame the question. Stars, thumbs, favorites, or sliders: any of these ratings input mechanisms are dead-on-arrival if they're not carefully considered within the context of use. 5-Star ratings require a little more cognitive investment than a simple 'I Like This' statement, so--before designing 5-star ratings into your system--consider the following.

Will it be clear to users what you're asking them to assess? It's not entirely surprising that YouTube's ratings overwhelmingly tend toward the positive. That's a long-observed and well understood phenomenon in the social sciences called Acquiescence Bias. It is "the tendency of a respondent to agree with a statement when in doubt." And 5-star ratings, in the case of YouTube, are nothing but doubt. What, exactly, is a fair and accurate quantitative assessment for a video on YouTube? The input mechanism does provide some clues, in the form of text hints for the various ratings levels (ranging from 'Poor' to 'Awesome!') but these are highly subjective and - themselves - way too open to interpretation.

Is a scale necessary? If the primary decision you're asking users to make is 'good vs. bad' or 'I liked it' or 'I didn't', then are multiple steps of decisioning really adding anything to their evaluation?

Are comparisons being made? Should I, as a user, rate videos in comparison to other similar videos on YouTube? What, exactly, distinguishes a 5-star football to the groin video from a 2-star? Am I rating against like videos? Or all videos on YouTube? (Or every video I've ever seen!?)

Have they watched the video? One way to encourage more-thoughtful ratings is to place the input mechanism at the proper juncture: make some attempt, at least, to ensure that the user is rating the thing only after having experienced it. YouTube's 5-star mechanism is fixed and always-present, encouraging drive-by ratings, premature ratings or just general sloppiness of assessment.

So, are stars inappropriate for YouTube, at least in the way that they've designed them? Probably, yes.

To wrap up, some quick links. Check out this elegant and innovative design that the folks at Steepster recently rolled out, and think about the ways it cleverly addresses all four of the concerns listed above.

October 28, 2009

Ebay's Merchant Feedback System

Reputation Wednesday is an ongoing series of essays about reputation-related matters. This week, we explore, to some depth, one of the Web's longest-running and highest-profile reputation systems. (We also test-drive our new Google-maps powered zoomable diagrams. Wheee!)

EBay contains the Internet's most well-known and studied user reputation or karma system: seller feedback. Its reputation model, like most others that are several years old, is complex and continuously adapting to new business goals, changing regulations, improved understanding of customer needs, and the never-ending need to combat reputation manipulation through abuse.

Rather than detail the entire feedback karma model here, we'll focus on claims that are from the buyer and about the seller. An important note about eBay feedback is that buyer claims exist in a specific context: a market transaction-a successful bid at auction for an item listed by a seller. This specificity leads to a generally higher quality-karma score for sellers than they would get if anyone could just walk up and rate a seller without even demonstrating that they'd ever done business with them; see Chapter 1- Implicit Reputation.

The scrolling/zooming diagram below shows how buyers influence a seller's karma scores on eBay. Though the specifics are unique to eBay, the pattern is common to many karma systems. For an explanation of the graphical conventions used, see Chapter 2.

We have simplified the model for illustration, specifically by omitting the processing for the requirement that only buyer feedback and Detailed Seller Ratings (DSR) provided over the previous 12 months are considered when calculating the positive feedback ratio, DSR community averages, and–by extension–power seller status. Also, eBay reports user feedback counters for the last month and quarter, which we are omitting here for the sake of clarity. Abuse mitigation features, which are not publicly available, are also excluded.

This diagram illustrates the seller feedback karma reputation model, which is made out of typical model components: two compound buyer input claims-seller feedback and detailed seller ratings-and several roll-ups of the seller's karma: community feedback ratings (a counter), feedback level (a named level), positive feedback percentage (a ratio), and the power seller rating (a label).

The context for the buyer's claims is a transaction identifier-the buyer may not leave any feedback before successfully placing a winning bid on an item listed by the seller in the auction market. Presumably, the feedback primarily describes the quality and delivery of the goods purchased. A buyer may provide two different sets of complex claims, and the limits on each vary:

1. Typically, when a buyer wins an auction, the delivery phase of the transaction starts and the seller is motivated to deliver the goods of the quality advertised in a timely manner. After either a timer expires or the goods have been delivered, the buyer is encouraged to leave feedback on the seller, a compound claim in the form of a three-level rating-positive, neutral, or negative-and a short text-only comment about the seller and/or transaction. The ratings make up the main component of seller feedback karma.

2. Once each week in which a buyer completes a transaction with a seller, the buyer may leave detailed seller ratings, a compound claim of four separate 5-star ratings in these categories: item as described,communications,shipping time,and shipping and handling charges.The only use of these ratings, other than aggregation for community averages, is to qualify the seller as a power seller.

EBay displays an extensive set of karma scores for sellers: the amount of time the seller has been a member of eBay; color-coded stars; percentages that indicate positive feedback; more than a dozen statistics track past transactions; and lists of testimonial comments from past buyers or sellers. This is just a partial list of the seller reputations that eBay puts on display.

The full list of displayed reputations almost serves as a menu of reputation types present in the model. Every process box represents a claim displayed as a public reputation to everyone, so to provide a complete picture of eBay seller reputation, we'll simply detail each output claim separately:

3. The feedback score counts every positive rating given by a buyer as part of seller feedback, a compound claim associated with a single transaction. This number is cumulative for the lifetime of the account, and it generally loses its value over time-buyers tend to notice it only if it has a low value.

It is fairly common for a buyer to change this score, within some time limitations, so this effect must be reversible. Sellers spend a lot of time and effort working to change negative and neutral ratings to positive ratings to gain or to avoid losing a power seller rating.
When this score changes, it is then used to calculate the feedback level.

4. The feedback level claim is a graphical representation (in colored stars) of the feedback score. This process is usually a simple data transformation and normalization process; here we've represented it as a mapping table, illustrating only a small subset of the mappings. This visual system of stars on eBay relies, in part, on the assumption that users will know that a red shooting star is a better rating than a purple star. But we have our doubts about the utility of this representation for buyers. Iconic scores such as these often mean more to their owners, and they might represent only a slight incentive for increasing activity in an environment in which each successful interaction equals cash in your pocket.

5. The community feedback rating is a compound claim containing the historical counts for each of the three possible seller feedback ratings-positive, neutral, and negative-over the last 12 months, so that the totals can be presented in a table showing the results for the last month, 6 months, and year. Older ratings are decayed continuously, though eBay does not disclose how often this data is updated if new ratings don't arrive. One possibility would be to update the data whenever the seller posts a new item for sale.

The positive and negative ratings are used to calculate the positive feedback percentage.

6. The positive feedback percentage claim is calculated by dividing the positive feedback ratings by the sum of the positive and negative feedback ratings over the last 12 months. Note that the neutral ratings are not included in the calculation. This is a recent change reflecting eBay's confidence in the success of updates deployed in the summer of 2008 to prevent bad sellers from using retaliatory ratings against buyers who are unhappy with a transaction (known as tit-for-tat negatives). Initially this calculation included neutral ratings because eBay feared that negative feedback would be transformed into neutral ratings. It was not.

This score is an input into the power seller rating, which is a highly-coveted rating to achieve. This means that each and every individual positive and negative rating given on eBay is a critical one–it can mean the difference for a seller between acquiring the coveted power seller status, or not.

7. The Detailed Seller Ratings community averages are simple reversible averages for each of the four ratings categories: item as described,communications,shipping time,and shipping and handling charges.There is a limit on how often a buyer may contribute DSRs.

EBay only recently added these categories as a new reputation model because including them as factors in the overall seller feedback ratings diluted the overall quality of seller and buyer feedback. Sellers could end up in disproportionate trouble just because of a bad shipping company or a delivery that took a long time to reach a remote location. Likewise, buyers were bidding low prices only to end up feeling gouged by shipping and handling charges. Fine-grained feedback allows one-off small problems to be averaged out across the DSR community averages instead of being translated into red-star negative scores that poison trust overall. Fine-grained feedback for sellers is also actionable by them and motivates them to improve, since these DSR scores make up half of the power seller rating.

8. The power seller rating, appearing next to the seller's ID, is a prestigious label that signals the highest level of trust. It includes several factors external to this model, but two critical components are the positive feedback percentage, which must be at least 98%, and the DSR community averages, which each must be at least 4.5 stars (around 90% positive). Interestingly, the DSR scores are more flexible than the feedback average, which tilts the rating toward overall evaluation of the transaction rather than the related details.

Though the context for the buyer's claims is a single transaction or history of transactions, the context for the aggregate reputations that are generated is trust in the eBay marketplace itself. If the buyers can't trust the sellers to deliver against their promises, eBay cannot do business. When considering the roll-ups, we transform the single-transaction claims into trust in the seller, and–by extension–that same trust rolls up into eBay. This chain of trust is so integral and critical to eBay's continued success that they must continuously update the marketplace's interface and reputation systems.

Time Decay in Reputation Systems

Time leeches value from reputation: the section called “First Mover Effects” discussed how
simple reputation systems grant early contributions are
disproportionately valued over time, but there's also the simple
problem that ratings become stale over time as their target
reputable entities change or become unfashionable - businesses
change ownership, technology becomes obsolete, cultural mores
shift.

The key insight to dealing with this problem is to remember the expression “What did you do for me this week?” When you're considering how your reputation system will display reputation and use it indirectly to modify the experience of users, remember to account for time value. A common method for compensating for time in reputation values is to apply a decay function: subtract value from the older reputations as time goes on, at a rate that is appropriate to the context. For example, digital camera ratings for resolution should probably lose half their weight every year, whereas restaurant reviews should only lose 10% of their value in the same interval.

Here are some specific algorithms for decaying a reputation score over time:

Linear Aggregate Decay

Every score in the corpus is decreased by a fixed percentage per unit time elapsed, whenever it is recalculated. This is high performance, but scarcely updated reputations will have dispassionately high values. To compensate, a timer input can perform the decay process at regular intervals.

Dynamic Decay Recalculation

Every time a score is added to the aggregate, recalculate the value of every contributing score. This method provides a smoother curve, but it tends to become computationally expensive O(n2) over time.

Window-based Decay Recalculation

The Yahoo! Spammer IP reputation system has used a time window based decay calculation: fixed time or a fixed-size window of previous contributing claim values is kept with the reputation for dynamic recalculation when needed. New values push old values out of the window, and the aggregate reputation is recalculated from those that remain. This method produces a score with the most recent information available, but the information for low-liquidity aggregates may still be old.

Time-limited Recalculation

This is the de facto method that most engineers use to present any information in an application: use all of the ratings in a time range from the database and compute the score just in time. This is the most costly method, because it involves always hitting the database to consider an aggregate reputation (say, for a ranked list of hotels), when 99% of the time the value is exactly the same as it was the last time it was calculated. This method also may throw away still contextually valid reputation. We recommend trying some of the higher-performance suggestions above.

Consider the following problem with simple averages: it is mathematically unreasonable to compare two similar targets with averages made from significantly different numbers of inputs. For the first target, suppose that there are only three ratings averaging 4.667 stars, which after rounding displays as , and you compare that average score to a target with a much greater number of inputs, say 500, averaging 4.4523 stars, which after rounding displays as only . The second target, the one with the lower average, better reflects the true consensus of the inputs, since there just isn't enough information on the first target to be sure of anything. Most simple-average displays with too few inputs shift the burden of evaluating the reputation to users by displaying the number of inputs alongside the simple average, usually in parentheses, like this: (142) .

But pawning off the interpretation of averages on users doesn't help when you're ranking targets on the basis of averages-a lone rating on a brand-new item will put the item at the top of any ranked results it appears in. This effect is inappropriate and should be compensated for.

We need a way to adjust the ranking of an entity based on the quantity of ratings. Ideally, an application performs these calculations on the fly so that no additional storage is required.

We provide the following solution: a high-performance liquidity compensation algorithm to offset variability in very small sample sizes. It's used on Yahoo! sites to which many new targets are added daily, with the result that, often, very few ratings are applied to each one.

This formula produces a curve seen in the figure below. Though a more mathematically continuous curve might seem appropriate, this linear approximation can be done with simple nonrecursive calculations and requires no knowledge of previous individual inputs.

This constant is the fractional amount to remove from the score before adding back in effects based on input volume. For many applications, such as 5-star ratings, it should be within the range of integer rounding error-in this example, if the AdjustmentFactor is set much higher than 10%, a lot of 4-star entities will be ranked before 5-star ones. If it's set too much lower, it may not have the desired effect.

LiquidityFloor

f = 10

This constant is the threshold for which we consider the number of inputs required to have a positive effect on the rank. In an ideal environment, this number is between 5 and 10, and our experience with large systems indicates that it should never be set lower than 3. Higher numbers help mitigate abuse and get better representation in consensus of opinion.

LiquidityCeiling

c = 60

This constant is the threshold beyond which additional inputs will not get a weighting bonus. In short, we trust the average to be representative of the optimum score. This number must not be lower than 30, which in statistics is the minimum required for a t-score. Note that the t-score cutoff is 30 for data that is assumed to be unmanipulated (read: random).
We encourage you to consider other values for a , c , and f , especially if you have any data on the characteristics of your sources and their inputs..

August 12, 2009

Ratings Bias Effects

Reputation Wednesday is an
ongoing series of essays about reputation-related matters. This
week's essay is excerpted from Chapter 4:
Building Blocks and Reputation Tips. It uses our experience
with Yahoo! data to share some thoughts surrounding user ratings
bias, and how to overcome it. You may be surprised by our
recommendations.

Figure: Some Yahoo! Sites Ratings
Distribution: "One of these things is not like the other. One of
these things just doesn't belong."

This figure shows the graphs of 5-star ratings from nine
different Yahoo! sites with all the volume numbers redacted. We
don't need them, since we only want to talk about the shapes of the
curves.

Eight of these graphs have what is known to reputation system
aficionados as J-curves-
where the far right point (5 Stars) has the very highest count,
4-Stars the next, and 1-Star a little more than the rest.
Generally, a J-curve is considered less-than ideal for several
reasons: The average aggregate scores all clump together between
4.5 to 4.7 and therefore they all display as 4- or 5-stars and are
not-so-useful for visually sorting between options. Also, this sort
of curve begs the question: Why use a 5-point scale at all?
Wouldn't you get the same effect with a simpler thumbs-up/down
scale, or maybe even just a super-simple favorite pattern?

The outlier amongst the graphs is for Yahoo! Autos Custom (which
is now shut down) where users were rating the car-profile pages
created by other users - has a W-curve. Lots of 1, 3, and 5 star
ratings and a healthy share of 4 and 2 star as well. This is a
healthy distribution and suggests that "a 5-point scale is good for
this community".

But why was Autos Custom's ratings so very different from
Shopping, Local, Movies, and Travel?

The biggest difference is most likely that Autos Custom users
were rating each other's
content. The other sites had users evaluating static, unchanging or
feed-based content in which they don't have a vested interest.

In fact, if you look at the curves for Shopping and Local, they
are practically identical, and have the flattest J hook - giving
the lowest share of 1-stars. This is a direct result of the
overwhelming use-pattern for those sites: Users come to find a
great place to eat or vacuum to buy. They search, and the results
with the highest ratings appear
first and if the user has experienced that object, they
may well also rate it - if it is easy to do so - and most likely
will give 5 stars (see the section called “First Mover
Effects”). If they see an object that isn't rated, but
they like, they may also rate and/or review, usually giving 5-stars
- otherwise why bother - so that others may share in their
discovery. People don't think that mediocre objects are worth the
bother of seeking out and creating internet ratings. So the curves
are the direct result of the product design intersecting with the
users goals. This pattern - I'm looking for good things so I'll
help others find good things - is a prevalent form of ratings bias.
An even stronger example happens when users are asked to rate
episodes of TV shows - Every episode is rated 4.5 stars plus or
minus .5 stars because only the fans
bother to rate the episodes, and no fan is ever going
to rate an episode below a 3. Look at any popular running TV show
on Yahoo! TV or [another site].

Looking more closely at how Autos Custom ratings worked and the
content was being evaluated showed why 1-stars were given out so
often: users were providing feedback to other users in order to get
them to change their behavior. Specifically, you would get one star
if you 1) Didn't upload a picture of your ride, or 2) uploaded a
dealer stock photo of your ride. The site is Autos Custom, after
all! The 5-star ratings were reserved for the best-of-the-best. Two
through Four stars were actually used to evaluate quality and
completeness of the car's profile. Unlike all the sites graphed
here, the 5-star scale truly represented a broad sentiment and
people worked to improve their scores.

There is one ratings curve not shown here, the U-curve, where 1
and 5 stars are disproportionately selected. Some
highly-controversial objects on Amazon see this rating curve.
Yahoo's now defunct personal music service also saw this kind of
curve when introducing new music to established users: 1 star came
to mean "Never play this song again" and 5 meant "More like this
one, please". If you are seeing U-curves, consider that the 1)
users are telling you something other than what you wanted to
measure is important and/or 2) you might need a different rating
scale.