Wallenius Bayes

Abstract

This paper introduces a new event model appropriate for classifying (binary) data generated by a “destructive choice” process, such as certain human behavior. In such a process, making a choice removes that choice from future consideration yet does not influence the relative probability of other choices in the choice set. The proposed Wallenius event model is based on a somewhat forgotten non-central hypergeometric distribution introduced by Wallenius (Biased sampling: the non-central hypergeometric probability distribution. Ph.D. thesis, Stanford University, 1963). We discuss its relationship with models of how human choice behavior is generated, highlighting a key (simple) mathematical property. We use this background to describe specifically why traditional multivariate Bernoulli naive Bayes and multinomial naive Bayes each are suboptimal for such data. We then present an implementation of naive Bayes based on the Wallenius event model, and show experimentally that for data where we would expect the features to be generated via destructive choice behavior Wallenius Bayes indeed outperforms the traditional versions of naive Bayes for prediction based on these features. Furthermore, we also show that it is competitive with non-naive methods (in particular, support-vector machines). In contrast, we also show that Wallenius Bayes underperforms when the data generating process is not based on destructive choice.

Keywords

Notes

Acknowledgements

Thank you very much to Michal Kosinski, David Stillwell and Thore Graepel for sharing the Facebook Likes data set. Thanks to our reviewers for helpful feedback. David thanks the Flemish Research Council (FWO) for financial support (Grant G.0827.12N). Foster thanks NEC and Andre Meyer for Faculty Fellowships. We thank the Moore and Sloan Foundations for their generous support of the Moore-Sloan Data Science Environment at NYU.

Appendix: Event models and choice axiom

The choice axiom is stated in terms of the probabilities of choices. In this Section we will adopt the notation of Luce (1959) for sets of choices R, S, T, U, individual choices x, y and preference probability P(x, y) (indicating how probable it is that x will be chosen over y in a pairwise comparison). A subindex \(P_S(\cdot )\) indicates the preference probability in a set S. And set-probability P(R) indicates the probability that any choice in R would be picked over the complement set \(\overline{R}\). The relationship under investigation is then such that it follows Axiom 1, presented below.

Axiom 1

[The Choice Axiom; Luce (1959)] Let T be a finite subset of U such that, for every \(S \subset T\), \(P_S\) is defined.

If \(P(x,y) = 0\) for some \(x,y \in T\), then for every \(S \subset T\):

$$\begin{aligned} P_T(S) = P_{T-\{x\}} (S - \{x\}). \end{aligned}$$

In what follows, we will drop the conditional dependency on class since it is not needed for our argument and leads to simpler notation. For brevity’s sake, we shall also assume that case 2 never occurs (i.e. there are no elements that will always be picked over other elements with deterministic certainty). This relaxation has negligible influence and the proof for case 2 is pretty straightforward should it be of concern.

The probability of picking a member of set R over a member in its complement set \(\overline{R}\) in S is easily found as the sum of the individual multinomial probabilities of each element in R in the bag of choices S.

with \(l < k\). We know from Eq. (9) that the probability of picking any choice depends on two counters, the number of non-zeroes for a particular choice (column) and the total number of instances (rows) in the data set, which without prior distribution leads to:

More importantly, the denominator in probabilities involving the subset S (\(P_S(\cdot )\)), potentially changes due to not all samples being in the member set under consideration. That is, there exist records for which we have no observed values in S, thus the total number of observed instances under consideration (\(n'\)) is smaller than the original number of observations (n):

Note how we changed notation from \(n'\) to n in Eq. (15b) by taking under consideration only samples for which at least one of the \(x_{i,j}\) is observed for at least one \(X_j \in S\) by using the indicator operator which selects exactly these elements (indeed, the sum will be greater than zero if one of the elements is active). In the numerator we can drop this condition since any element for which \(x_{i,j}\) is zero, would be cancelled out anyway due to the multiplication with \(x_{i,j}\); this leads to the final formulation in Eq. (15c).

Just like for the multinomial event model, the first part of the axiom follows by filling in these values. Of course, matters get a little bit more complicated when multiple choices are selected since the permutations need to be taken into account. These will again factor out when multiplied with each other, leading to the confirmation of the axiom.