The Data Mining Forum This forum is about data mining, data science and big data: algorithms, source code, datasets, implementations, optimizations, etc. You are welcome to post call for papers, data mining job ads, link to source code of data mining algorithms or anything else related to data mining. The forum is hosted by P. Fournier-Viger. No registration is required to use this forum!.

And notice that order of items in itemset matters.
E.g itemset {1,2,3} and itemset (2,1,3} will generate different frequent sequences.
Itemset {1,2,3} will generate {1,2} as subsequence while itemset (2,1,3} will generate {2,1}.

Previously, I used an R package called arulesequence and it will sort items in itemset internally. E.g {1,2,3} and {2,1,3} will both become {1,2,3}.

Yes, in sequential pattern mining the order in itemsets should not be important. However, for practical purposes, all itemsets should be sorted according to some order in your input file, as explained in the documentation:

QuoteNote that it is assumed that items are sorted according to a total order in each itemset and that no item appears twice in the same itemset.

That order can be any order such as increasing order or decreasing order but it must be the same for all itemsets in your file.

Now, having said that, I want to also tell you that this particular algorithm for valued items may not alway generate the same result. The reason is that this algorithm applies k-means internally. Since k-means is a randomized algorithm, each time your run it, it may generate different clusters. Thus for this reason, if you apply this algorithm several times, you may get different results, even with the same input file.

My current way of defining itemset is that for a sequence of events that a customer took in the history, an itemset is the events happened in the same day. So in the final frequent sequence, I am able to know this sequence covers how many days. But I also want to reserve the original order of events in an itemset. So what if I reserve the order in an itemset anyway, will this cause some problem? Or destroy the robustness of algorithm?

And for your last comment about randomized result, is this just for valued items, which mean item is like A{3}, B{4}? (referred from your paper)
If my dataset only contains A, B and no values following, will the result be fixed?
Since I ran your sample data several times and found the results are the same.

> My current way of defining itemset is that for a
> sequence of events that a customer took in the
> history, an itemset is the events happened in the
> same day. So in the final frequent sequence, I am
> able to know this sequence covers how many days.
> But I also want to reserve the original order of
> events in an itemset. So what if I reserve the
> order in an itemset anyway, will this cause some
> problem? Or destroy the robustness of algorithm?
>

Yes, in sequential pattern mining, there is the assumption that there is no order between the items in an itemset appearing in a sequence. In order words, it is assume that items in an itemset occur at the same time or are simultaneous.

In SPMF, many algorithms further assume that all itemsets are sorted according to some total order such as the increasing order of items, or any other order. But this order must always be the same. The reason for this is that the algorithm will follow that order to explore the search space of all possible patterns. If there is no total order on items in itemsets, then it is possible that some patterns will not be found, or that the support will be incorrectly calculated. For example, consider this database of two sequences:

Sequence 1: (a b)(c) -- This means that A B appeared at the same time and were followed by C
Sequence 2: (a b) (c)

In that database, the pattern (a b） appears two times and has a support of 2.

Now consider another database:

Sequence 1: (b a )(c)
Sequence 2: (a b) (c)

This database violates the assumption that itemsets are sorted according to some order for processing purposes. In that case, if you apply PrefixSpan, it would probably find that the pattern (b a) has a support of 1 and that the pattern (a b) has a support of 1. This is obviously incorrect.

So that order between items for processing purposes is quite important.If you do not always use the same order inside the itemsets, it is possible that you miss some patterns or that the support is incorrectly calculated in the above example.

Now, there are several algorithms in SPMF. I cannot remember the behavior of each one of them. But I think that for most of them, that problem would occurs because all algorithms typically follow some order to process items in an itemset.

So, if you want to keep the order between items for some application, maybe you can find another way to do it. You could for example, save the information about the order in a separated file and keep it for the purpose of your application, and then give a file where itemsets are sorted to SPMF?

>
> And for your last comment about randomized result,
> is this just for valued items, which mean item is
> like A{3}, B{4}? (referred from your paper)
> If my dataset only contains A, B and no values
> following, will the result be fixed?
> Since I ran your sample data several times and
> found the results are the same.

Yes, you are right. It is only when using the valued items that K-Means will be used and that the results will be randomized.

Hope this won't bother you much! I just found some more questions about this algorithm.

1. In the document of Fournier08-Closed+time algorithm: http://www.philippe-fournier-viger.com/spmf/ClosedSequentialPatterns_TimeConstraints.php,

for maximum time interval, you mentioned: Note : It is recommended to set max_whole_interval to a very large value because if it is used with closed sequential pattern mining, the algorithm become approximate and may not return all closed itemsets respecting the time constraint (the reason is that this constraint is not compatible with the "backScan pruning" of the BIDE+ algorithm).
So how large this large should be set to make it exact not approximate algorithm? Does it depends on data size?

2. For the input data, should timestamp always start with timestamp 0? Or can I pass in a sequence like: <4> 48 <7> 18 23 31 45 48 -2? And the algorithm will convert first timestamp to 0 internally?

3. Now we know sequence ids for output patterns. But since the timestamps in output patterns are relative, it seems pretty hard to pinpoint the exact timestamps of certain pattern in the original sequence for each seq_id?

> So how large this large should be set to make it
> exact not approximate algorithm? Does it depends
> on data size?

The problem with the maximal time interval constraint is that if you apply this constraint when doing closed sequential pattern mining, you may miss some patterns. If you don't care about missing a few patterns maybe, you can still use that constraint.

Now, how to avoid this? The solution is to not use that constraint. If you set the maximal time interval to a value that is larger than the time length of all your sequences (for example 1 million), it will be as if this constraint is not used. In that case, there will be no problem.

If you do closed sequential pattern mining and set the maximal time interval to a small value, then you may miss patterns. And yes, it depends on your data.

Otherwise, if you want to use that constraint and not miss any patterns, you can do sequential pattern mining but not find the closed patterns. Then, you will not miss patterns and be able to use that constraint without missing patterns.

>
> 2. For the input data, should timestamp always
> start with timestamp 0? Or can I pass in a
> sequence like: <4> 48 <7> 18 23 31 45 48 -2? And
> the algorithm will convert first timestamp to 0
> internally?

It does not matter because the algorithm calculates the relative time between the itemsets. So you could start from 0, 1 or any other numbers, and I think that it should work.

>
> 3. Now we know sequence ids for output patterns.
> But since the timestamps in output patterns are
> relative, it seems pretty hard to pinpoint the
> exact timestamps of certain pattern in the
> original sequence for each seq_id?
>

Yes, you could take the file containing the patterns and sequence IDs, and then do a post-processing step where you would compare the patterns with the sequences to get the real time stamps. I think if you write a small program to do that, it should not be too difficult.

Cool!
Since I only want to use min_time_interval and max_time_interval but not min_whole_interval and max_whole_interval, so I can still get the whole list of closed sequences if I set max_whole_interval to a larger value than my whole time range.

Thanks for your reply!

BTW, there is a tiny typo in the document for this algorithm (http://www.philippe-fournier-viger.com/spmf/ClosedSequentialPatterns_TimeConstraints.php)
For instance, the last pattern S7 indicates that the items 1 and 2 were followed by item 1 one time unit after. This pattern has a support of 75 % because it appears in S1, S2 and S3. It is important to note that the timestamps in the sequential patterns found are relative. For example, the pattern S16 is considered to appear in S1, S2 and S3 because {1} appears one time unit after {1, 2} in all of these sequences, even though the timestamps do not need to be the same in all of these sequences.
Here it should be S6 instead of S16

Well.. I don't know if my understanding of max_time_interval is wrong but I try to run java -jar spmf.jar run Fournier08-Closed+time spmf_input_ex.txt output1.txt 60% 0 5 0 1000000 true in CML, where spmf_input_ex.txt is:

So is there a closed mining algorithm implementation in your library that can add time constraints but will treat the time interval between adjacent itemsets the same as long as they pass the time constraint? (e.g <0> 1 2 <1> 3 4 and <0> 1 2 <2> 3 4 are the same and they both looks like 1 2 -1 3 4 -1)

Since I want to add time gap between two itemsets. But whether the gap is 1 or 2 doesn't matter too much. And if we treat them as different patterns, it dilutes supports of both patterns.

I previously used a c-spade algorithm implemented in R (https://www.rdocumentation.org/packages/arulesSequences/versions/0.2-19/topics/cspade), which can add time constraints and treat both <0> 1 2 <1> 3 4
and <0> 1 2 <2> 3 4 as 1 2 -1 3 4 -1. Well.. But it can't do for closed pattern mining so I am seeking for other resources.

I see. There is no such implementation in SPMF that does exactly that. It could be done, I think, but it would require some programming to modify the algorithm and it can be more or less complicated. If one modifies it, then it would need to check to make sure that the algorithm remains correct, and sometimes combining two ideas results in an algorithm that cannot find all the patterns. This is the case for example when combining closed sequential pattern mining with BIDE and the max-whole-interval constraint.