This page in a nutshell: We analyzed the volume of feedback posted via the AFTv5 widget and ran statistical tests to assess the significance of the differences that we found. The results show that: among the feedback form designs, option 1 and option 2 generated the highest proportion of feedback with text; combining the widget with a feedback link significantly increases the volume of comments; AFTv5 outperforms AFTv4 in terms of conversions (feedback per impression); the vast majority of feedback is from anonymous readers, consistently with what we observed in AFTv4.

The present analysis is based on data collected from a random set of articles of the English Wikipedia. AFT5 was initially launched on December 20 on 11,611 articles randomly sampled from enwiki (0.3% of all articles, with the exclusion of redirects and disambiguation pages). On January 4 the sample was doubled in size (0.6%), reaching a total of 22,480 articles. A second set of approximately 100 hand picked articles was selected to study the nature and volume of feedback generated by "anomalous articles", such as high-traffic, trending or controversial articles. The selection of these hand picked articles was arbitrary, and this sample cannot be used to make any inference about the whole project. The present analysis is therefore limited to the random sample.

We consider different designs, placements and design-specific options as "treatments" applied to the same population of articles. We build series of daily snapshots for each treatment and use statistical tests to assess whether the difference between the means of these series is significant. We use a t-test when comparing differences between two treatments and one-way analysis of variance when comparing jointly the differences of multiple treatments. Pairwise differences for treatments from the ANOVA are then assessed via Tukey's range test. Significance levels in what follows are represented as (*), (**) or (***) for p-values smaller than .05, .01 and .001 respectively. Unless otherwise stated the analyses are based on series spanning the entire observation period from 19 December 2011 to 24 January 2012.

Over the 37 first days of the pilot (Dec 19 - Jan 24) we collected a total of 9,063 pieces of feedback, 6,034 of which (66%) included text. The plot below represents the volume of feedback collected daily (darker line), along with the portion of feedback including text. Three markers indicate milestones at which major events occurred that affected the volume of feedback collected: the increase of the sample size to 0.6% (A, 4 January 2012), the introduction of a feedback link and the overlay widget (B, 11 January 2012) and the SOPA blackout (C, 18 January 2012).[1]

8,656 pieces of feedback (95.5% of all comments) were submitted by anonymous readers (or editors who were not logged in), consistently with AFT4 which showed a nearly identical proportion of feedback by unregistered users. Among these users, the vast majority (93.6% and 95.6% for the bottom and overlay widget respectively) submitted only one piece of feedback, indicating that multiple comments by the same user are at this date very unfrequent. At the current stage of the deployment, however, this may be due to the very small size of the article sample on which AFT5 was enabled.

The boxplot below represents the distribution of daily feedback with text by design. We found a significant difference in daily feedback between option 1 and option 3 (**) and between option 2 and option 3 (*), while the difference between option 1 and option 2 was not found to be significant.

AFT5 Option 1 asked users to signal whether they "found" or "didn't found" what they were looking for. Text is not mandatory. The plot below represents the volume of feedback with text collected daily and flagged with "found" (blue) and "not found" (red) respectively.

Since January 4, 2012[2] option 1 generated 1,482 and 1,368 posts flagged as "found" and "not found" respectively (617 and 1,050 of which with text). The boxplot below represents the distribution of daily feedback with or without text for the two flags used in option 1. A comparison of all feedback (with or without text) submitted with the "found" or "not found" flag didn't yield any significant difference. Similarly, no significant difference could be found in the volume of feedback with text of the two treatments until the random sample was doubled in size. After doubling the sample, a significant difference (***) was found between "not found" with text vs"found with text".

AFT Option 2 asked users to submit feedback by selecting one of four categories, "suggestion", "praise", "problem" or "question" . Selecting one category is required to submit feedback and text is mandatory (as no useful feedback is captured by submitting a category alone). The order of the 4 categories was not randomized and the "suggestion" category was pre-selected as a default for all users. The plot below represents the volume of feedback with text collected daily via AFT5 option 2 and filed under "suggestion" (blue), "praise" (green), "question" (orange) and "problem" (red) respectively.

Since January 4, 2012[2] option 2 generated the following volume of feedback with text: "suggestion": 1,018, "praise": 240, "problem": 291, "question": 311. The boxplot below represents the distribution of daily feedback for each of the four categories used in option 2. The use of a fixed default resulted in a significantly higher volume of feedback filed under "suggestion" than under any of the other 3 categories (***). No other significant difference was found by comparing with each other the remaining options.

AFT5 Option 3 asked users to rate the article on a single 5-star scale. Selecting a rating is not required.

Since January 9, 2012[3] option 3 generated the following volume of feedback with text per rating value: "r5": 193, "r4": 79, "r3": 56, "r2": 41, "r1": 109. The analysis of the breakdown of feedback by rating value indicates that extreme values (r5 and r1) produced the highest proportion of feedback (consistently with what previously observed in AFT4). The difference in volume of feedback for r5 and r4, r3 and r2 is significant (***) while the difference in volume between r1 and r4, r3 or r2 was found to be significant at a slightly lower level (**). No other categories displayed significant differences with respect to each other.

AFT5 was initially launched using the same default placement as AFT4, i.e. as a relatively positioned widget at the bottom of the article. This condition offers a useful comparison for conversions with AFT4.We collected data using the bottom placement until January 11, 2012 when a new overlay placement was introduced, combined with a fixed-position feedback link at the bottom right corner of the screen.

The introduction of a feedback link to open an overlay with the AFT widget on top of the bottom increased the overall volume of feedback by a 1.7x factor. However the breakdown of feedback originating from the bottom widget compared to the overlay indicates that the majority of feedback is still being generated via the bottom widget. The overlay widget alone, triggered by the feedback link, generates a slightly smaller amount of feedback. Since January 11, 2012[4], the bottom widget generated 3,295 feedback posts, the overlay widget 2,363 posts and the two widgets combined 5,658 posts. The difference between the bottom and overlay widget series, measured after January 11, 2012, is significant (**) and so is the difference between combined volume and any of the two placements (***).

We compared the number of feedback records collected via AFT5 in 5 days (January 12, 2012 - January 16, 2012) with the number of ratings collected via AFT4 during the same period on a random sample of the same size (N=22,480) as the current AFT5 random sample. To measure conversions, we collected pageviews for articles in both samples and compared their distribution.

With the exception of the top 1‰ articles by traffic in both samples (22 articles, which show exceptionally high peaks of traffic), the remaining 999‰ display a remarkably similar distribution in traffic, as shown by the quantile plot below (top 1‰ removed).

The following plot compares the daily feedback volume for articles in the above samples, i.e.:

During this test, AFT5 outperformed AFT4 by generating 3x more conversions on a random sample of articles of equal size, controlled for pageviews. AFT4 was also outperformed by AFT5 option 3 (bottom placement), i. e. the most similar option in design/placement to AFT4, which generated 40% more conversions than the latter.

The AFT5 data dashboards hosted on the Toolserver present real-time AFT5 metrics for articles in the random and hand picked samples as well as streams of comments generated by different designs.

During this first test phase, we collected data that helped us understand the volume generated by different designs of AFT5 and by a first attempt at using a more prominent placement. A complementary analysis was conducted to study the quality of feedback collected. The main findings of the quantitative analysis of feedback volume are the following:

Design

Of all designs, tested on three populations of randomly selected users, Option 1 ("Did you find what you were looking for") generated the highest volume of feedback.

A significantly higher proportion of comments posted via Option 1 was flagged under the "not found" category, suggesting that users are more likely to submit feedback when they cannot find the information they are looking for.

Option 2 ("Make a suggestion") produced a slightly smaller volume of comments than Option 1, but still significantly larger than Option 3 ("Rate this article"). The lack of randomization on the order of the 4 sub-options and the use of "suggestion" as a default don't allow us to establish the difference in volume between these categories.

Option 3 produced the smallest volume of feedback of all three designs and generated rating data displaying a typical polarization for extreme values.

Placement

The combined effects of the overlay widget and the bottom-positioned widget significantly outperformed the default placement (bottom). However, the feedback link by itself was not prominent enough to significantly increase the volume of feedback when compared with the bottom widget alone: replacing the bottom widget with the overlay widget alone will actually reduce the total volume of feedback.

Usage

More than nine out of ten people who post feedback are anonymous readers or unregistered editors. This is consistent with what we observed in AFT4.

Conversions

An analysis of feedback by page views on a controlled sample of articles showed that AFT5 is converting at a much higher rate (3x more) than AFT4. We also observed a slight but significant increase in volume between AFT4 and the most closely matching AFT5 design and placement (option 3, bottom placement).

↑ abOn January 3, 2012 the size of the random sample was doubled. This data series starts on the day after this change was made.

↑Due to a bug in the initial release, feedback submitted with a null value was not stored and the form failed silently. We estimated that this caused a loss of the order of 25% of all comments submitted via option 3. A patch was released on January 9, 2012 to fix this issue, as a result the analysis is restricted to data collected after this date.

↑The "optionD" feedback link and the overlay widget were deployed on January 11, 2012.