Ranking Problems

Practical machine learning applications such as information retrieval and recommendation internally solve ranking problem which generates and returns a ranked list of items. Hivemall provides a way to tackle the problems as follows:

This page focuses on evaluation of the results from such ranking problems.

Caution

In order to obtain ranked list of items, this page introduces queries using to_ordered_map() such as map_values(to_ordered_map(score, itemid, true)). However, this kind of usage has a potential issue that multiple itemid-s (i.e., values) which have the exactly same score (i.e., key) will be aggregated to single arbitrary itemid, because to_ordered_map() creates a key-value map which uses duplicated score as key.

Hence, if map key could duplicate on more then one map values, we recommend you to use to_ordered_list(value, key, '-reverse') instead of map_values(to_ordered_map(key, value, true)). The alternative approach is available from Hivemall v0.5-rc.1 or later.

Binary Response Measures

In a context of ranking problem, binary response means that binary labels are assigned to items, and positive items are considered as truth observations.

In a dummy_truth table, we assume that there are three users (userid = 1, 2, 3) who have exactly same three truth ranked items (itemid = 1, 2, 4) chosen from existing six items:

userid

itemid

1

1

1

2

1

4

2

1

2

2

2

4

3

1

3

2

3

4

Additionally, here is a dummy_rec table we obtained as a result of prediction:

userid

itemid

score

1

1

10.0

1

3

8.0

1

2

6.0

1

6

2.0

2

1

10.0

2

3

8.0

2

2

6.0

2

6

2.0

3

1

10.0

3

3

8.0

3

2

6.0

3

6

2.0

How can we compare dummy_rec with dummy_truth to figure out the accuracy of dummy_rec?

To be more precise, in case we built a recommender system, let a target user u∈Uu \in \mathcal{U}u∈U, set of all items I\mathcal{I}I, ordered set of top-k recommended items Ik(u)⊂II_k(u) \subset \mathcal{I}I​k​​(u)⊂I, and set of truth items Iu+\mathcal{I}^+_uI​u​+​​. Hence, when we launch top-2 recommendation for the above tables, U={1,2,3}\mathcal{U} = \{1, 2, 3\}U={1,2,3}, I={1,2,3,4,5,6}\mathcal{I} = \{1, 2, 3, 4, 5, 6\}I={1,2,3,4,5,6} and I2(u)={1,3}I_2(u) = \{1, 3\}I​2​​(u)={1,3} which consists of two highest-scored items, and Iu+={1,2,4}\mathcal{I}^+_u = \{1, 2, 4\}I​u​+​​={1,2,4}.

Here, we introduce the six measures for evaluation of ranked list of items. Importantly, each metric has a different concept behind formulation, and the accuracy measured by the metrics shows different values even for the exactly same input as demonstrated above. Thus, evaluation using multiple ranking measures is more convincing, and it should be easy in Hivemall.

Caution

Before Hivemall v0.5-rc.1, recall_at() and precision_at() are respectively registered as recall() and precision(). However, since precision is a reserved keyword from Hive v2.2.0, we renamed the function names. If you are still using recall() and/or precision(), we strongly recommend you to use the latest version of Hivemall and replace them with the newer function names.

Mean Average Precision (MAP)

While the original Precision@k provides a score for a fixed-length recommendation list Ik(u)I_k(u)I​k​​(u), mean average precision (MAP) computes an average of the scores over all recommendation sizes from 1 to ∣I∣|\mathcal{I}|∣I∣. MAP is formulated with an indicator function for ini_ni​n​​ (the nnn-th item of I(u)I(u)I(u)), as:
MAP=1∣Iu+∣∑n=1∣I∣Precision@n⋅[in∈Iu+].
\mathrm{MAP} = \frac{1}{|\mathcal{I}^+_u|} \sum_{n = 1}^{|\mathcal{I}|} \mathrm{Precision@}n \cdot [ i_n \in \mathcal{I}^+_u ].
MAP=​∣I​u​+​​∣​​1​​​n=1​∑​∣I∣​​Precision@n⋅[i​n​​∈I​u​+​​].

Area Under the ROC Curve (AUC)

ROC curve and area under the ROC curve (AUC) are generally used in evaluation of the classification problems as we described before. However, these concepts can also be interpreted in a context of ranking problem.

Basically, the AUC metric for ranking considers all possible pairs of truth and other items which are respectively denoted by i+∈Iu+i^+ \in \mathcal{I}^+_ui​+​​∈I​u​+​​ and i−∈Iu−i^- \in \mathcal{I}^-_ui​−​​∈I​u​−​​, and it expects that the best recommender completely ranks i+i^+i​+​​ higher than i−i^-i​−​​. A score is finally computed as portion of the correct ordered (i+,i−)(i^+, i^-)(i​+​​,i​−​​) pairs in the all possible combinations determined by ∣Iu+∣×∣Iu−∣|\mathcal{I}^+_u| \times |\mathcal{I}^-_u|∣I​u​+​​∣×∣I​u​−​​∣ in set notation.

Mean Reciprocal Rank (MRR)

If we are only interested in the first true positive, mean reciprocal rank (MRR) could be a reasonable choice to quantitatively assess the recommendation lists. For ntp∈[1,∣I∣]n_{\mathrm{tp}} \in \left[ 1, |\mathcal{I}| \right]n​tp​​∈[1,∣I∣], a position of the first true positive in I(u)I(u)I(u), MRR simply returns its inverse:
MRR=1ntp.
\mathrm{MRR} = \frac{1}{n_{\mathrm{tp}}}.
MRR=​n​tp​​​​1​​.
MRR can be zero if and only if Iu+\mathcal{I}^+_uI​u​+​​ is empty.

In our dummy tables depicted above, the first true positive is placed at the first place in the ranked list of items. Hence, MRR=1/1=1\mathrm{MRR} = 1/1 = 1MRR=1/1=1, the best result on this metric.

Normalized Discounted Cumulative Gain (NDCG)

Normalized discounted cumulative gain (NDCG) computes a score for I(u)I(u)I(u) which places emphasis on higher-ranked true positives. In addition to being a more well-formulated measure, the difference between NDCG and MPR is that NDCG allows us to specify an expected ranking within Iu+\mathcal{I}^+_uI​u​+​​; that is, the metric can incorporate reln\mathrm{rel}_nrel​n​​, a relevance score which suggests how likely the nnn-th sample is to be ranked at the top of a recommendation list, and it directly corresponds to an expected ranking of the truth samples.