Stagewise learning for noisy k-ary preferences

Abstract

The aggregation of k-ary preferences is a novel ranking problem that plays an important role in several aspects of daily life, such as ordinal peer grading, online image-rating, meta-search and online product recommendation. Meanwhile, crowdsourcing is increasingly emerging as a way to provide a plethora of k-ary preferences for these types of ranking problems, due to the convenience of the platforms and the lower costs. However, preferences from crowd workers are often noisy, which inevitably degenerates the reliability of conventional aggregation models. In addition, traditional inferences usually lead to massive computational costs, which limits the scalability of aggregation models. To address both of these challenges, we propose a reliable CrowdsOUrced Plackett–LucE (COUPLE) model combined with an efficient Bayesian learning technique. To ensure reliability, we introduce an uncertainty vector for each crowd worker in COUPLE, which recovers the ground truth of the noisy preferences with a certain probability. Furthermore, we propose an Online Generalized Bayesian Moment Matching (OnlineGBMM) algorithm, which ensures that COUPLE is scalable to large-scale datasets. Comprehensive experiments on four large-scale synthetic datasets and three real-world datasets show that, COUPLE with OnlineGBMM achieves substantial improvements in reliability and noisy worker detection over other well-known approaches.

Statistics for the PeerGrading datasets

Table 8 summarizes some of the key statistics of the Poster (PO) and Final Report (FR) datasets. More information can be found in Raman and Joachims (2014).

Table 8

Statistics for the poster (PO) and final report (FR) datasets from the classroom experiment

Dataset

Objects (L)

Workers (W)

TAs

PO

42

148

7

FR

44

153

9

Table 9

Number of k-ary preferences in the PO and FR datasets

k-ary preferences

2-ary

3-ary

4-ary

5-ary

6-ary

7-ary

Total number

Poster (PO)

165

348

309

85

4

1

912

Final report (FR)

153

146

17

0

0

0

316

Data preprocessing

All gradings are done on 10-point (cardinal) Likert scale, where 10 denotes “perfect”, 8 “good”, 5 “borderline”, 3 “deficient”and 1 “unsatisfactory”. For instance, preference \(\rho _w:O_5 (9.0)\ \tilde{> }\ O_3 (7.0)\ \tilde{ > }\ O_2 (1.0)\) denotes preference \(\rho _w\) was annotated by crowd worker w. However, students are not trained graders so they may be grading on different scales. In particular, different students may have different preconceptions of what constitutes a score 8 if they come from different universities. Therefore, we only use the ordinal information provided by each grader. In addition, as ties are not considered in COUPLE and CrowdBT, the preferences containing ties (equal scores) are splitted into independent preferences. To be specific, preference \(\rho _w:O_5\ \tilde{> }\ O_3\ \tilde{ = }\ O_4\ \tilde{ > }\ O_2\) breaks into two preferences \(\rho _{1,w}:O_5\ \tilde{> }\ O_3\ \tilde{ > }\ O_2\) and \(\rho _{2,w}:O_5\ \tilde{> }\ O_4\ \tilde{ > }\ O_2\). Detailed information about the two post-processed datasets appears in Table 9.

Poster (PO) dataset contains 912 distinct preferences, , and most have a length of 3 or 4. The Final Report (FR) dataset is much smaller (316), and almost \(95 \%\) of data are 2-ary and 3-ary preferences. Overall, PO and FR are consistent with the definition of crowdsourced k-ary preferences.