What hasn't been well described in the past is just how simple the method really is. In most of the applications above, it is important to find useful features in large amounts of data. The method at the heart of all of these algorithms is to use a score to analyze counts of events, particularly counts of when events occur together. The counts that you usually have in these situations are the number of times two events have occurred together, the number of times that they have occurred with or without each other and the number of times anything has occurred.

To compute the score, it is handy to restate these counts slightly as the number of times the events occurred together (let's call this k_11), the number of times each has occurred without the other (let's call these k_12 and k_21) and the number of times something has been observed that was neither of these events (let's call this k_22). In this notation, I am using row or column 1 to be the event of interest and row or column 2 to be everything else.

Here is the table we need

Event A

Everything but A

Event B

A and B together (k_11)

B, but not A (k_12)

Everything but B

A without B (k_21)

Neither A nor B (k_22)

Once you have these counts, the hard part is over. Computing the log-likelihood ratio score (also known as G2) is very simple,

LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))

where H is Shannon's entropy, computed as the sum of (k_ij / sum(k)) log (k_ij / sum(k)). In R, this function is defined as

H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}

The trick with adding k==0 handles the case where some element of k is zero. Since we are multiplying by that same zero, we can drop those terms but it helps not to try to compute the log(0). The java or C versions of this (ask me for a copy) is almost as simple, but not quite as pretty.

There are several approaches. Probably the simplest is a multi-pass approach where you first find good bigrams, then re-interpret those as single tokens and find interesting bigrams where one component is itself composite. You can obviously repeat that any number of times.

The other approach is to build a log-likelihood ratio test based on an order-2 Markov model versus an order-1 Markov model. The Markov models in question predict a word based on the preceding two words or the preceding one word of context. In my dissertation, I do a complete derivation of this, but I now think that it would be easier to use the mutual information trick to see how much information gain there is for using the larger context.

I can send you a copy of my dissertation if you like. My gmail address is ted.dunning.

Hi,I tried your score in Mahout and it works practically very good for many recommendation tasks.Yet, here I would like to discuss an irregular case potentially occurs. In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.And around that, the score increases with decrease of k1(cross viewed count).This behavior seems not so good. Are there any solution exist about such a case?

Hi,I tried your log-likelihood score in Mahout and it works practically very good for many recommendation tasks.Yet, here I would like to discuss an irregular case potentially occurs. In the case k1=9, k2=3000, n1=3000 n2=1000000(p1=p2=p=0.003), twoLogLambda converges to zero.And around that, the score increases with decrease of k1(cross viewed count).This behavior seems not so good and it might be caused by the model used in the test formula. Are there any solution or discussion already exist about this problem?

In that case the score should be zero since the frequencies are identical. Any deviations will cause a positive score.

If you want the score to go above or below zero according to whether the value of k1 is above or below the expected value relative to k2 then you can use the variant of the score that I call the signed root log likelihood ratio. The basic idea is that you take the square root and apply a sign according to whether k1 is above or below the expected value. Mahout has an implementation of this.

The theory is actually fairly simple. A random variable distributed as $\chi^2$ will have a square root that is normally distributed except for sign. The restoration is the sign here is symmetrical only in the asymptotic case but for practical applications that doesn't matter.

Your definition of mutual information is a bit off and is very subject to errors with small counts. I would worry a bit. If you use the correct definition then I think that multiplying by the number of samples is a good thing since that gives you a measure of anomaly rather than association. For a threshold, anomaly is really what you are after ... other mechanisms are usually used for weighting anyway.

Regarding trigrams, one of the easiest implementations is to simply make a pass looking for significant bigrams. Then, considering those bigrams as single words, make another pass looking for bigrams that include bigrams from the first pass. These composite bigrams are really trigrams. You can repeat this process as you like.

Mathematically, this is a bit off, but the implementation easy really outweighs the values of orthodoxy in this case.

Thanks for the blog post and paper -- it's great that you've gone through the trouble of explaining the usefulness of LLR as a similarity score.

I'm having some practical problems with using item similarities when the number of (binary) coratings for different pairs of items varies over several orders of magnitude.

Since LLR (or signed-sqrt-LLR) is a test statistic, its value for each item pair scales linearly (or as the square root) with the number of coratings N.

The problem is the following: for each item I am picking out some number of 'most similar items'. If I use the statistic that depends on N, the 'most similar items' are items that have a ton of coratings but aren't necessarily all that similar. If I normalize so that they don't depend on N, then the 'most similar items' tend to be those with small numbers of coratings, since those similarity scores will be very noisy.

What I'm doing right now is an awful kludge where I'm 'partially' normalizing by some arbitrary fractional power of N and seeing what looks the best by eye. This is dumb for a bajillion reasons.

I bet this is a standard problem and that there is a standard way of dealing with it, especially within the context of a simple recommendation system, but I'm not a statistician or ML expert and have no idea what it is. Thoughts?

The problem that you are encountering is that LLR is purely a test of association rather than a measure of same. It is best used in a thresholding style that gates the use of other association measures. In that usage, it provides a stop against the over-fitting that simple measures of association tend to exhibit.

As an example, I have used it very effectively to determine whether TF-IDF scores should be used in a document classifier. The TF-IDF scores were not really even a measure of association with the desired topic, but when gated by LLR, the results were stunningly good.

If LLR isn't a measure of similarity, then what exactly does Mahout's implementation of LogLikelihoodSimilarity do?

I've been trying to get my head around this concept for a while, and your blog post is the closest I get to understanding the relationship between LLR and similarity...but you say that they are not quite the same thing.

I appreciate your patience in responding to my questions -- I've been looking for a connection between LLR and similarity, but while there are a lot of tutorials on LLR, there's nothing that ties that concept to that of similarity between two items.

HI, iam working on Speech enhancement.I want to compare results by using Method A and Method B.Which method is said to be good for waht value of LLR?(EXAMPLE:If LLR of mthod A is greater than Method B???)

sunnydayal vanambathina said...HI, iam working on Speech enhancement.I want to compare results by using Method A and Method B.Which method is said to be good for waht value of LLR?(EXAMPLE:If LLR of mthod A is greater than Method B???)

I am not at all clear about what you will be doing. Generally, tasks like speech recognition have their own figures of merit such as perplexity. With speech enhancement, I would guess that you have more subjective metrics.

Hi Ted, Need small input here. I am working with a data similar to user-item in a recommendation scenario i.e.,,,etc..My aim is to compare the different similarity measures between each items i.e. , etc and form a graph to perform clusters and see which one is doing a better job of giving me meaningful clusters. I already have normalized mutual information as similarity measure for the graph. I want to check LLR for the same. As per, your comments the only modification I would require is multiplying the "total count" into MI. Is the "total count" is total number of users in the system, or is this the the total count of users across i1 (event A) and i2 (event B)?

Hi Ted, Need small input here. I am working with a data similar to user-item in a recommendation scenario i.e.,,,etc..My aim is to compare the different similarity measures between each items i.e. , etc and form a graph to perform clusters and see which one is doing a better job of giving me meaningful clusters. I already have normalized mutual information as similarity measure for the graph. I want to check LLR for the same. As per, your comments the only modification I would require is multiplying the "total count" into MI. Is the "total count" is total number of users in the system, or is this the the total count of users across i1 (event A) and i2 (event B)?

Sorry. I see that some of part of my question was not posted well. The data I was referring to (in 2nd line):u1, i1u2,i2u2,i1 etc...Hope this makes it clear.

I am not sure if I got it correct from your reply. But here is what I am understanding right now:

The "total count" as total number of users in i1 (event A) and i2 (event B). Since MI (mutual information) can be very high for two items that co-occured by chance, multiplying the total count would affect in removing those biases (occur by chance).For example: Case 1: i1 is liked by 70 users, and i2 is liked by 60 (of the same 70 users)Case 2: i1 and i2 is liked by 2 same users

MI(case 2) > MI (case 1). Which is a wrong signal for similarity. Whereas when multiplied by "total count" to both cases then it becomes a more correct signal.

Let me know if I am understanding it right. Or if I am missing a point here.

Sorry. I see that some of part of my question was not posted well. The data I was referring to (in 2nd line):u1, i1u2,i2u2,i1 etc...Hope this makes it clear.

I am not sure if I got it correct from your reply. But here is what I am understanding right now:

The "total count" as total number of users in i1 (event A) and i2 (event B). Since MI (mutual information) can be very high for two items that co-occured by chance, multiplying the total count would affect in removing those biases (occur by chance).For example: Case 1: i1 is liked by 70 users, and i2 is liked by 60 (of the same 70 users)Case 2: i1 and i2 is liked by 2 same users

MI(case 2) > MI (case 1). Which is a wrong signal for similarity. Whereas when multiplied by "total count" to both cases then it becomes a more correct signal.

Let me know if I am understanding it right. Or if I am missing a point here.

Currently, I am using the raw cooccurrence of both events (e.g. (View^T * Purchases) * userViewVector) and get quite satisfying results.

Nevertheless, I wonder how to apply LLR to this problem. Lets say event A is the view of an item and event B is the purchase of this item. k11 would be the cooccurrence of both, k21 would be a view without a purchase afterwards, k12 might never happen since one must first view an item before a purchase and k22 would be all observations without interactions related to this item. Is this correct?

Furthermore, if I now computed a matrix of LLR values indicating which cooccurrence might be useful, would it be useful to multiply the LLR value with the cooccurrence to scale the cooccurrence according to its "confidence"?

where f(A,B) is a count-dependent scaling factor, which has been discussed here already, and H[A,B] and H[A] and H[B] are the entropy of the joint probability and marginals, respectively.

Here comes the question: Isn't the term in brackets actually the *negative* mutual information between A and B, because MI[A,B] = H[A] + H[B] - H[A,B]? Or are you using a different convention here? (See response from May 20, 2012 at 3:17 PM).

Clarification would help me a lot in better shaping my intuition of how to use LLR as the beautifully simple measure to detect and interpret similarities that it certainly is.

where f(A,B) is a count-dependent scaling factor, which has been discussed here already, and H[A,B] and H[A] and H[B] are the entropy of the joint probability and marginals, respectively.

Here comes the question: Isn't the term in brackets actually the *negative* mutual information between A and B, because MI[A,B] = H[A] + H[B] - H[A,B]? Or are you using a different convention here? (See response from May 20, 2012 at 3:17 PM).

Clarification would help me a lot in better shaping my intuition of how to use LLR as the beautifully simple measure to detect and interpret similarities that it certainly is.

Hi Ted,I'm having difficulty in giving a probabilistic interpretation to LLR in the case of collaborative filtering. Can you state what do events A and B correspond to if we build a contingency table for two users in the context of collaborative filtering.

I would love to help, but I don't quite understand why a probabilistic interpretation of LLR would do you any good. The LLR stuff in terms of collaborative filtering is a totally deterministic way to get indicators which are then also used deterministically. Theoretical analysis is essentially worthless here because of the complex nature of the actual data.

Empirical analysis has been done by Schelter et al in

http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf

In any case, the A and B in this case do not refer to two users, but refer to two actions that users might do. The k_11 cell is the number of users who have done both A and B. k_12 is the number of users who have done A but not B and so on. The goal here is to find events which are linked by cooccurrence in user histories.

The paper above has an excellent description of the way to do this as well as some pragmatic things such as down-sampling.

You can ask questions like this on the Mahout mailing list or directly to my email address which you can find on many of my slide shows on slideshare.

Thanks for sharing this, and pointing to the original article –both great!

I have a doubt. I perfectly see using the the LL for word-bigrams. But I have a question on how to use it on, say, books purchases. It looks to me the formula is not enough –I might have misunderstood something though.

I understand that if two rare words appear in the same bigram, the "signal" is clearly high. Alwaus. But for books purchase the situation is different. If two rare books appear on the shelves of two customers who bought the majority of books, the signal is not high anymore. At least in one case, the signal was low.

As we have no "bigrams" with books, unless we observe when book B is bought after book A, the formula should take in consideration also the amount of books bought by both users. Am I missing something?

The idea of cooccurrence can be substantially expanded in the case of products. With words, you consider how many times each word occurs anywhere and how many times they occur in sequence. With items purchased, you typically consider how many times each item was purchased by any user as well as how often the two items are purchased by the same user. That, combined with the total number of users, gives you all four of the numbers you need to use the LLR test.

The idea of cooccurrence can be substantially expanded in the case of products. With words, you consider how many times each word occurs anywhere and how many times they occur in sequence. With items purchased, you typically consider how many times each item was purchased by any user as well as how often the two items are purchased by the same user. That, combined with the total number of users, gives you all four of the numbers you need to use the LLR test.

Clearly. The information provided by a rare book is higher than the information provided by Harry Potter.

> how often the two items are purchased by the same user

Still, the information provided by this number is not uniform. Let's take two books which were bought just once, by the a single user.

If such user only bought a dozen books in total, this indicates a high similarity between the two books.

If such user is a big customers, who bought 90% of all books sold on the site, this does not add much to the picture.

Still, in both cases, the weight added into the coocurrence matrix would be the same, whoever customer (big or small) bought the two books together.

In a real case, we have business accounts who would create similarity between two books, although it's clear that they have rare books in their shelves only because they own thousands of books. I'd then prefer a higher weight when the customer owns very few books. I might be wrong, but I saw the problem as the "two people with the same birthday" and computed the information according to a negative binomial distribution. If I have correctly understood, the LLR would not lead to the same result, correct?

Whatever the correctness of my approach above –thanks again for your answer, it was very much appreciated!

If I understand LLR, 0 means zero multiples of random chance, and I find that if I give it 13 in this context, that's pretty close to the expected value of 12.7.

Now, notice I commented out the line with the sum(k) formula. Intuitively I thought it would be that. I'm a CS guy, not a math guy, so my intuition is worth about 10 cents. But is my EV formula right? And why? Why is my denominator k[2,2] and not sum(k)?

Mostly want to ensure I have the EV formula right, but curious about any intuition why it is right if it is easy enough to explain.

Next question. LLR naturally gives weight to any strong deviation from chance. Here I present something close to LLR = 0, as well as two strong deviants (note per my other posting, I'm adding a sign using expected value):

Now, I'm using this as a product recommendation engine. As a constraint of my business problem, I'm interested in recommending only common products, so does that mean I want the high positive LLR values only?

Seems to me, this is analogous to a one-tail versus two-tail problem. If I want *any* deviant event, I want either "tail". But if I want common products for high-volume sales, I want high positive values. Conversely, if I were doing something like fraud detection, I might want only the strong negative values--because I'm looking for rare deviations.

Thoughts? In short, if I'm recommending products for the masses, is it safe to just use high positive values?

Hi Ted!Is LLR chi square distributed with 1 degree of freedom? If so, than I can use tables for chi square distribution to select threshold for significant co-occurrences with desired level of confidence. Right?The other question is about formula. In some articles I saw 2*log(MatrixEntropy - RowEntropy - ColumnEntropy) and you have 2*(MatrixEntropy - RowEntropy - ColumnEntropy). It doesn't matter for ranking, but matters for statistical analysis. Which one is right?Thanks in advance!

Yes, LLR is chi^2 distributed exactly as Pearson's test is distributed (asymptotically in both cases). For 2x2 tables, this gives one degree of freedom. With n x m tables, there will be (n-1) (m-1) degrees of freedom.

And yes, you can use standard tables. The only proviso is that the square root of LLR is commonly used with a sign that reflects whether the upper left entry in the contingency table is larger or smaller than expected. Root LLR is normally distributed, ideally.

The formula should be 2 N (MatrixEntropy - RowEntropy - ColumnEntropy), but you have to watch out because un-normalized implementations of entropy are sometimes used which incorporate the factor of N.

Can you point me to articles that use each formula? (especially if I have been inconsistent?)

In this article: http://wortschatz.uni-leipzig.de/~sbordag/papers/BordagMC08.pdfI saw that formula L=(MatrixEntropy - RowEntropy - ColumnEntropy) and LLR = 2 log(L). It seems to be wrong.

If I'm building a recommender system than 'N' is total number of users, right? If N is big enough, like several millions, than almost every co-occurrence seems to be significant with high confidence since chi square distribution even with alpha=0.005 equals only 7.879 (http://sites.stat.psu.edu/~mga/401/tables/Chi-square-table.pdf). That is a little strange for me.

Regarding you comment about the increasing sensitivity of the test, yes, smaller and smaller effects become visible. Of course, as N increases, you also typically get more and more things that you are testing (more words, more word pairs, more products, whatever). That makes any traditional statistical interpretation quite difficult.

What I recommend is that you simply rank all the results you get rather than choosing a threshold from a table. The biggest ones will tend to be the most interesting.

Thanks for the post. I am trying to apply this method to a movie recommandation system. Each movie has a rating between 1 and 5 so I think the matrik k in my case is:k_11 = number of users 'like' movie A and Bk_12 = number of users 'like' movie A and 'dislike' Bk_11 = number of users 'dislike' movie A and 'like' Bk_11 = number of users 'dislike' movie A and 'dislike' Bwhere 'like' means a rating >3 and 'dislike' <=3.What do you think?Thanks in advance.

I think that it is better to use "did not rate as having liked" as the opposite of "liked", however. In fact, the simple act of rating, regardless of value may be just as valuable a feature.

What I would recommend is that you try using this in the context of a multi-modal recommender. Check out Pat Ferrell's blog on the topic: http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/

Thanks so much. I have another quick question about predicting the ratings. Assuming that I build the matrix with all the log-likelihood ratios (llr) for the items then the so called recommendation vector r = h_p*llr where h_p is the history of the user. Unfortunately, the vector r does not contain predicted retings but it can be used to rank the items to recommend. Is there a way to extract predicted ratings from r?Do you think that interpreting the llr entries as weights I can use weighted average to predict ratings? For example r_1 = (h_p*llr_{...,1})/sum(llr_{...,1} if h_p_i >0) where llr_{...,1} is the first column of the llr matrix and the if statement sum only the weights corresponding to ratings >0 (i.e. the user has seen the movies).Thanks again.

I hope I am not too emphatic here, but why in the world would you really want to predict ratings? Prediction of a ratings is just a left-over from academic work in the 90's and has no real validity in the real world.

When you build a recommendation system in the world, there is only one goal. That is whether it made your users happy. Presenting them with content that they want to see and causing them to engage with what you showed them is what makes them happy. Your users could not possibly care less about whether you predicted what rating they might put on content.

Not only is prediction of the rating not useful, it is counter-productive because it is matching a very odd behavior (rating your content) that is done by a minute part of your audience (typically a few percent unless you force ratings). The results are, not surprisingly, very odd.

It is much better to try to measure whether users actually engage with content you offer. This doesn't mean rate it. It might not even mean consume. With videos, I have found 30 second watches to be good surrogates for engagement. With products, simple measures of product page engagement such as scrolling or click to view more is a good surrogate. My experience with clicks was that it teaches the recommender to spam users and my experience with ratings is that you get very little, very odd data that seems to have little to do with normal user behavior.

By just looking into the numbers it seems that items in S2 should be more similar than the items in S1 as in S1 they occurred together only once compared to S2 where the items occurred together 101 times. In both the cases k11 + k12 + k21 + k22 is same.

The issue is that LLR is not a measure of similarity. It is a measure of anomaly. It tells you where there is likely to be a non-zero interaction, but doesn't tell you even the sign of the interaction.

In your case, S1 has anomalously low cooccurrence of the two products. You often see this with items with strong brand loyalty like, say, razor blades. In S2, you have anomalously high cooccurrence.

To make this easier to see and deal with, I sometimes use the square root of the LLR score and add a sign according to whether k11 is larger than you might expect or smaller. Since LLR asymptotically $chi^2(1)$ distributed, the square root will be half-normal distributed. With the sign, you now have a measure which is (very) roughly calibrated in units of standard deviations above or below expectations. The Mahout code you mention has an implementation of this. See the Mahout implementation.

In the dissertation, note that lots of the stuff isn't required for all readers. There are 5 chapters that describe particular applications of the technique to different domains. Each of those is relatively short, stands alone and can be read separately if you have a similar interest.

Ted,Thank you very much again for Python code for LLR!!It is really my pleasure to read your book PracticalMachineLearning MAPR.pdfOnly it would be very kind of you share some simple language (Python, C, ...) education code to understand how LLR works for recommender systems.You did something with PIGhttps://github.com/tdunning/poniesSample recommender flow for search as recommendationbut it is black boxBy the way perhttps://github.com/tdunning/sequencemodelThe sequence anomaly detector from our second in the Practical Machine Learning SeriesLooking forward read the second book

I'am not sure that you can help me, but...I don't have any problems to use LLR ratio to find similar items when I have users and items. I use spark-itemsimilarity nad I get results with item, item, LLR. I use Chi-squared distribution with one degrees of freedom and I know which two items are significant similary.

But I want to use spark-rowsimilarity, where I have data like: item, and 4 atributes (for example. author, subject, etc.). I read on mahout page (https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html) that it's LLR too, but I'm not sure how it's calculaded, so I don't know which test it is exactly. It's still chi-squared? If yes, with one degree of freedom beacause we compare 2 item, or for example (4-1)(4-1)=9, because we have 4 atributes?

I do worry that you seem to be looking at LLR scores too much in the lens of a significance test. That is fine when you are looking at a single test, but when you are doing cooccurrence testing, you have millions to billions of potential tests that you are doing. Furthermore, any upstream downsampling will make significance even harder to interpret. Also, you mention using the chi^2 distribution. I find it easier to compare the signed square root of the LLR to the normal distribution. This allows me to differentiate over and under representation and a scale denominated in standard deviations is easier to talk about with many people. In any case, I usually set the global cutoff to something like 3 to 5 standard deviations (AKA chi^2 of 9 - 25). I typically set this limit by looking at the indicators produced and picking a limit such that there is a large, but not overwhelming number of garbage indicators being produced.

The limit to 100 indicators often implies local cutoffs much larger (say 20-50 standard deviations which are comparable to chi^2 scores 400-2500). These cutoffs are very high in terms of single test significance levels, but may still be lower than what you might need to use if you used something like the Bonferroni correction. In any case, we don't care about *whether* there is structure that violates the null hypothesis. We *know* that there is. We want to predict behavior in the future.

I think that it is very useful to step outside of traditional hypothesis testing. What I find more useful for many situations today is to think more about building models and trying to understand how they will perform on data that we haven't yet seen. IF you look at Breiman's famous paper at https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726, you can see a really good description of the situation and the culture gap.

So what I think is better for building models is to use a global cutoff for all scores that is estimated from first principle, but then in each specific case have a secondary cutoff in terms of number of interesting connections (typically limited to 50 to 100).

For you second question about cooccurrence with attributes, this is most easily done by simply pretending that the characteristics themselves are specially labeled items in the user history and throwing them into the mix for cooccurrence. This isn't a great way to do it, but it can be done very quickly. Better still is to use code specifically designed for cross-occurrence.

For cross-occurrence, it is still the same 2x2 test as with cooccurrence and the same distributions still hold asymptotically in the case of the null hypothesis. The same objections arise about using a threshold test as a test of significance as well.

Do you think that LLR can be used in conjunction with association rules? For example, can we somehow make use of the significance test results of LLR to refine/filter/back up the most significant rules given based on lift or confidence?

Let me give you a simple example. Consider two rules A->B and A->C where B is an extremely popular product in my dataset and of course more popular than C. Based on confidence metric, it is likely to have a higher value for the first rule as opposed to the second one. In such cases, can the LLR come in and provide a strong indication of independence for the first rule and dependence of the second?

I have concerns that popular items that happen to be along with others a lot will have high scores and overshadow other strong correlations. I am confident that LLR can help with this right?

Yes, LLR is a good way to select pairs like this. These aren't really association rules as such any more though. The name that I use is for this is "indicator". I would write the association rule as indicator -> target.

Commonly, the way that this works is to pick the highest N indicators for a particular target by LLR score. If the indicator is very common in general, it will take a whole lot of cooccurrences to get a high LLR score. The frequency of the target doesn't really matter all that much since all of the indicators for that target have the same target prevalence (by definition).

In your example, you have two rules A->B and A->C which indicates you are thinking about things in terms of a common indicator. It is usually better to be a bit more target centric in your thinking (that is, consider A->C, B->C instead). After all, once you have your rules, you can always sort them by indicator instead of target.