The Data Mining Forum This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.

Is it possible to do same thing using the AprioriRare (http://www.philippe-fournier-viger.com/spmf/index.php?link=documentation.php#example17 ) and AprioriInverse (http://www.philippe-fournier-viger.com/spmf/index.php?link=documentation.php#example18 ) algorithms?

I am sorry if this question is a result of my poor understanding of the algorithms.

Actually, there are two main types of Apriori algorithms: those based on Apriori, which scan the database to calculate the support of patterns, and those that based on AprioriTID, which keep the transaction identifiers in memory to avoid scanning the database.

CORI is an AprioriTID based algorithm. Since it is based on AprioriTID, it keeps the transaction IDs of each pattern in memory. For this reason, it was easy to add the feature of showing the transaction IDs in the output for this algorithm.

AprioriInverse and AprioriRare are based on Apriori instead of AprioriTID. For this reason, they do not keep the transaction identifiers in memory for each pattern. Thus, the information is not available when it is time to write the patterns to the output file.

It would be possible to modify AprioriInverse and AprioriRare to show the transaction identifiers but it would require more work than for CORI. It would require to re-implement these algorithms based on AprioriTID instead of Apriori.

Because it is more complicated to add the feature of showing the transactions IDs in the output of AprioriInverse and AprioriRare, I have not implemented that feature for these algorithms. But do you really need that feature? If you really need this feature for your research, I could implement it. Perhaps that it would require one day of work.

I appreciate your clear explanation of difference between CORI and AprioriRare/AprioriInverse.

In my ongoing research project, I am trying to find rare/interesting/unexpected patterns and which transaction has such patterns. Comparing three algorithms, CORI, AprioriRare, and AprioriInverse is important for me because they have different interestingness definitions.

To obtain TIDs in AprioriRare and AprioriInverse, I tried a naive method: after running these algorithms, searching all transactions containing each itemset again. The method worked but was very slow. Probably somehow indexing transaction database before searching can make the process faster. However, if AprioriTID-based AprioriRare and ApriroriInverse are available, these must have best performance.

For those reasons, I really need the TID feature. If you can make time to implement the feature, it will be great help for my project. And if you need details of my research project, I am happy to explain it by an email to you.

I see. This is great. I have thus implemented the feature for you today, and uploaded a new version of the SPMF software on the website. If you download it again from the download page, you will get the new version.

It includes two new algorithms: AprioriRare_TID and AprioriInverse_TID

For both of them, I have added the option of showing the transaction identifiers.

Note that these new versions of AprioriRare and AprioriInverse do not use the bitset optimization that is used in AprioriTID_bitset. If the performance is slow, I could add the bitset optimization to these algorithms. This optimization consists of representing the list of transactions for patterns as bit vector. For some dataset, it can improve the speed but not all dataset. If the dataset is very sparse, it will not help and may even consume more memory. So this is the reason why I have not added this optimization.

I really appreciate your very prompt implementation of AprioriRare_TID and AprioriInverse_TID!

I tried both the algorithms with my 3 million transactions. They successfully obtained the results with TIDs.

I realized that AprioriRare_TID and AprioriInverse_TID are much faster than AprioriRare and AprioriInverse, respectively. I compared them using the same dataset and same SPMFv2.18. If you have any comments on these results, could you let me/us know?

In general, I would expect AprioriTID to be faster than the regular Apriori. Whether AprioriTID is faster than Apriori depends mostly on whether your data is sparse or dense. Let me explain this with an example.

Let's say that we have a pattern X. To calculate the support of X, the Apriori algorithm will scan the database and compare each transaction with X to calculate the support. This is really costly. If there are 3 million transactions, each pattern would be compared with each transaction. Of course, some optimizations are possible. For example, we could sort transactions by their size. Then, we can compare a pattern X with only the transactions that contains |X| items or less. This would avoid scanning the whole database. There are a few optimizations that can be done like this to reduce the cost of scanning the database.

For the AprioriTID algorithm, a different process is used to calculate the support of a given pattern X. Rather than scanning the database, AprioriTID will use some lists of transaction ids. For example, if we have two patterns {a} and {b}, AprioriTID could have some lists like that:

Then, if we want to calculate the support of {a,b}, we can simply intersect the list of transactions of {a} with that of {b}, as follows:

T1, T2, T3, T4 INTERSECTION T2, T3, T4 = T2 T3 T4

So we have that {a,b} appears in T2, T3 and T4 and thus has a support of 3.

Now, will it be faster to calculate the support using these lists of transactions, instead of scanning the database?
It will be faster if the list of transactions are small, that is if each item do not appear in many transactions. For example, if you have 3 million transactions, but each item appears in only about 500 transactions, then computing the support will be quite fast. Instead of scanning 3 million transactions, you can expect to just have to intersect two lists of about 500 transactions. This should be much faster.

Thus, to summarize, AprioriTID should be faster for sparse datasets (datasets where each item has a short list of transactions). For dense datasets, perhaps that Apriori could be faster.

Up to now, I have only discussed about runtime. But in terms of memory, AprioriTID will likely require more memory than Apriori. For Apriori, we just need to keep the database in memory (or in a file). For AprioriTID, we need to keep a list of transaction for each itemset currently in memory. Thus, the memory usage of AprioriTID could likely be much more than that of Apriori in some case.

To address this problem, there is an optimization of AprioriTID where we can implement the list of transactions as bit vectors (bitsets) instead of lists of integers. This can have the advantage of greatly reducing the memory usage especially for datasets that are more dense. Besides, calculating the intersection of two bit vectors can be much faster than computing the intersections of two lists of integers.

Another way of reducing the memory usage of AprioriTID is to combine Apriori and AprioriTID. In the original Apriori paper, this approach is called AprioriHybrid. It consists of applying the regular Apriori for small itemsets as they usually have long lists of transactions, so that can reduce the memory. Then, for larger itemsets, the algorithm used the list of transactions of AprioriTID to reduce the execution time. I have not implemented AprioriHybrid. But I just mention this as another variation of Apriori.

If you have other questions about this, you can let me know. I can give more information about this. Also, if you want to discuss something more specific related to your project, we can also discuss by e-mail. Besides, you may perhaps be interested to read my survey about itemset mining, which gives some general introduction to itemset mining and discusses some of these issues:

Thank you very much again for your detailed explanation of Apriori, Apriori_TID and their optimization methods. I will also peruse your newest review paper.

My transaction dataset is certainly sparse. For the same dataset and same MinSup, AprioriRare_TID and AprioriRare take 275 seconds (2.3Gbytes) and 13 hours (1.4Gbytes), respectively. Without introducing the bitset optimization and AprioriHybrid, the performance of ApriotriRare_TID and AprioriInverse_TID is sufficient for my dataset in term of functionality, memory usage, and runtime.

I will email to you when my research project progresses or I have further questions and discussions.