An Introduction to Sequential Pattern Mining

In this blog post, I will give an introduction to sequential pattern mining, an important data mining task with a wide range of applications from text analysis to market basket analysis. This blog post is aimed to be a short introductino. If you want to read a more detailed introduction to sequential pattern mining, you can read a survey paper that I recently wrote on this topic.

What is sequential pattern mining?

Data mining consists of extracting information from data stored in databases to understand the data and/or take decisions. Some of the most fundamental data mining tasks are clustering, classification, outlier analysis, and pattern mining. Pattern mining consists of discovering interesting, useful, and unexpected patterns in databases Various types of patterns can be discovered in databases such as frequent itemsets, associations, subgraphs, sequential rules, and periodic patterns.

The task of sequential pattern mining is a data mining task specialized for analyzing sequential data, to discover sequential patterns. More precisely, it consists of discovering interesting subsequences in a set of sequences, where the interestingness of a subsequence can be measured in terms of various criteria such as its occurrence frequency, length, and profit. Sequential pattern mining has numerous real-life applications due to the fact that data is naturally encoded as sequences of symbols in many fields such as bioinformatics, e-learning, market basket analysis, texts, and webpage click-stream analysis.

I will now explain the task of sequential pattern mining with an example. Consider the following sequence database, representing the purchases made by customers in a retail store.

This database contains four sequences. Each sequence represents the items purchased by a customer at different times. A sequence is an ordered list of itemsets (sets of items bought together). For example, in this database, the first sequence (SID 1) indicates that a customer bought some items a and b together, then purchased an item c, then purchased items f and g together, then purchased an item g, and then finally purchased an item e.

Traditionally, sequential pattern mining is being used to find subsequences that appear often in a sequence database, i.e. that are common to several sequences. Those subsequences are called the frequentsequential patterns. For example, in the context of our example, sequential pattern mining can be used to find the sequences of items frequently bought by customers. This can be useful to understand the behavior of customers to take marketing decisions.

To do sequential pattern mining, a user must provide a sequence database and specify a parameter called the minimum support threshold. This parameter indicates a minimum number of sequences in which a pattern must appear to be considered frequent, and be shown to the user. For example, if a user sets the minimum support threshold to 2 sequences, the task of sequential pattern mining consists of finding all subsequences appearing in at least 2 sequences of the input database. In the example database, 29 subsequences met this requirement. These sequential patterns are shown in the table below, where the number of sequences containing each pattern (called the support) is indicated in the right column of the table.

For example, the patterns <{a}> and <{a}, {g}> are frequent and have a support of 3 and 2 sequences, respectively. In other words, these patterns appears in 3 and 2 sequences of the input database, respectively. The pattern <{a}> appears in the sequences 1, 2 and 3, while the pattern <{a}, {g}> appears in sequences 1 and 3. These patterns are interesting as they represent some behavior common to several customers. Of course, this is a toy example. Sequential pattern mining can actually be applied on database containing hundreds of thousands of sequences.

Another example of application of sequential pattern mining is text analysis. In this context, a set of sentences from a text can be viewed as sequence database, and the goal of sequential pattern mining is then to find subsequences of words frequently used in the text. If such sequences are contiguous, they are called “ngrams” in this context. If you want to know more about this application, you can read this blog post, where sequential patterns are discovered in a Sherlock Holmes novel.

Can sequential pattern mining be applied to time series?

Besides sequences, sequential pattern mining can also be applied to time series (e.g. stock data), when discretization is performed as a pre-processing step. For example, the figure below shows a time series (an ordered list of numbers) on the left. On the right, a sequence (a sequence of symbols) is shown representing the same data, after applying a transformation. Various transformations can be done to transform a time series to a sequence such as the popular SAX transformation. After performing the transformation, any sequential pattern mining algorithm can be applied.

It provides implementations of several algorithms for sequential pattern mining, as well as several variations of the problem such as discovering maximal sequential patterns, closed sequential patterns and sequential rules. Sequential rules are especially useful for the purpose of performing predictions, as they also include the concept of confidence.

What are the current best algorithms for sequential pattern mining?

There exists several sequential pattern mining algorithms. Some of the classic algorithms for this problem are PrefixSpan, Spade, SPAM, and GSP. However, in the recent decade, several novel and more efficient algorithms have been proposed such as CM-SPADE and CM-SPAM (2014), FCloSM and FGenSM (2017), to name a few. Besides, numerous algorithms have been proposed for extensions of the problem of sequential pattern mining such as finding the sequential patterns that generate the most profit (high utility sequential pattern mining).

Conclusion

In this blog post, I have given a brief overview of sequential pattern mining, a very useful set of techniques for analyzing sequential data. If you want to know more about this topic, you may read the following recent survey paper that I wrote, which gives an easy-to-read overview of this topic, including the algorithms forf sequential pattern mining, extensions, research challenges and opportunities.

Periodic patterns are discovered in a single sequence. We want to find a pattern that periodically appears. For example, it must appear approximately every week.

Sequential patterns are discovered in a sequence database (multiple sequences). We want to find a pattern that is common to several sequences. If a pattern appears multiple times in the SAME sequence, it will still be counted as just once for sequential pattern mining algorithms.

I am working on an interesting problem now. I’d like to pattern mine stack traces like the following and teach a parser how to process them.
“Error: Something unexpected has occurred.
at main (c:\Users\Me\Documents\MyApp\app.js:9:15)
at Object. (c:\Users\Me\Documents\MyApp\app.js:17:1)
at Module._compile (module.js:460:26)”
The interesting part, that according to your definition the “\nat * (*:*:*)” is both a sequential and a periodic pattern, because it is present in many different sequences and in a single sequence multiple times. Is there a way to distinguish it from the “^*Error: *\n**$” pattern, which is present only once per sequence beside that it is more frequent? Are these periodic pattern somehow special? Any thoughts, suggestions, recommended algorithms?

Hello,
Yes, you can see a stack trace as a sequence perhaps. Then the type of error could be seen as a “class” that annotates the sequence. For example, if you use my SPMF software, you can discover some class sequential rules using the TopSeqClassRules. That means a type of sequential pattern that is specific to some class such as Error:Somethime unexpected happened” and that has the form of a rule such as X –> Y. I think you could check various algorithms and see what makes the most sense for your problems. Best,

Internally, all the algorithms use a number based representations of items/events because it is much more efficient when searching for patterns than using strings, in terms of space and memory. For example, if you have two strings “abcdefghij” and “abcdefghik”, then you can represent them as strings or as integers. If internally, we represent as strings, then we need memory to store about 10 characters, while an integer is only 32 bits or 64 bits. Besides, comparing two integers is much faster than comparing two strings. If you compare two integers, you just use maybe one instruction of the CPU (>= <>) while comparing two strings, requires to compare each character one by one. This is the reason why these algorithms uses numbers as internal representation.

Having said that, this is also the reason why the default input format for most algorithms in SPMF is based on integers. But you can also use String. A simple way is to convert your input file from strings to integers, apply the algorithm, and then convert the integers to strings in the result. I have actually, implemented something like that in SPMF. If you are using the user interface of SPMF, you can use string as input for some algorithms like CM-SPAM. Here is a description of how to do that from the documentation of SPMF for that algorithm:

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database “contextPrefixSpan.txt”. Here we have modified the file to give names to the items:

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called “apple”. The third line indicates that the item 2 is called “orange”. The 9th line indicates that the symbol “-1” must be replaced by “|”. Then the following lines define four sequences in the SPMF format.

Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns having this format:

apple | orange | #SUP: 4

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided.

If you are not using the graphical interface of SPMF, but want to use strings, you can check example, this example of the documentation:

In my case it does not help much, but I could use ASCII codes to convert a trace to a seq of numbers. Most of the traces contain ASCII characters only I think, so for testing purposes this will be just fine. As far as I can tell the rules of your input format is the following: 1 sequence in 1 line and the items are separated by space. I guess there is no way to use itemsets instead of items, but in my case itemsets are not needed.

Yes, the format of the input file for the algorithms is explained in the documentation on the webpage for each algorithm. For example, to mine sequential patterns using a normal algorithms like CM-SPAM or sequential rules with RuleGrowth, the format is:

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value “-1” indicates the end of an itemset. The value “-2” indicates the end of a sequence (it appears at the end of each line). For example, the input file “contextPrefixSpan.txt” contains the following four lines (four sequences).

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

Note that it is also possible to use a text file containing a text (several sentences) if the text file has the “.text” extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by “.”, “?” and “!”, where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

As far as I understand these pattern mining algorithms find subsequences or substrings. I’d like to convert my patterns to regex patterns, so parsing would be a lot easier. For example if I have the following sequences: “abbc”, “abxc”,”aybc”,”abzc”, then the “abc” will be the common subsequence with 4/4 support. But if I want to parse the sequences with “abc”, then I will get the following results: “ab(b)c”,”a(b)bc”,”ab(x)c”,”a(y)bc”,”ab(z)c”. As you can see the content of the wildcard can be either before or after the “b” item. So if I want to translate this to regex I would have 2 patterns: /ab(.*)c/”, “a(.*)bc” or we can merge them into one: /a(.*)b(.*)c/. If I expect a single capturing group always on the same place, then I have to choose between the first two regex patterns. Is there any algorithm which interprets patterns like this (so checks where the gap position is by matching), or do I have to invent some kind of converter between the subsequence patterns and the regex patterns? I could write a parser based on subsequences too, but I assume the C based regex of the actual js engine is a lot faster than writing something in pure javascript.

I started reading your survey paper ( A Survey of Sequential Pattern Mining). I had a query. You stated that an item set X is a subset of items I.
Thus an item, say A, can appear at most ‘k’ number of times in an itemset if it appears ‘k’ number of items in set I.
However, while explaining BFS approach you took I = {a,b,c} and while enumerating 2-sequences you generated as well.
I guess there is an error in this enumeration.

Also while explaining definition of support of subsequence there is a mistake in its formulated definition.
Given is:
sup(sa) = | { s | s contained in sa ^ s belongs to SDB} |
I guess it should be:
sup(sa) = | { s | sa contained in s ^ s belongs to SDB} |

For the other question, I think that something is missing in your question. Do you want to ask why the same item can appear multiple times in the same pattern? If this is your question, then yes, an item is allowed to appear multiple in a same sequence. Let me explain this more clearly. In sequential pattern mining, an item can only appear at most one time in each itemset. But a sequence can contain multiple itemsets. Thus, it is possible that the same item appear multiple times in the same sequence by appearing in several itemsets (because a sequence is a list of itemsets). For example, the sequence (a)(a) is correct because it contains two times “a” but not in the same itemset. But the sequence (a a) would be incorrect according to the definition because it contains many times the same item in the same itemset. So, to explain again, an itemset X in a sequence must be a subset of I. Thus, an item cannot appear more than once in an itemset. However, a sequence contains many itemsets. Thus, the same item is allowed to appear multiple time in the same sequence, but it must be in different itemsets.

I have a question about the example in this article. Is correct that the pattern has a support of 2? Because for what I understood this pattern appears in sequences 1, 3 and 4. And the same for the pattern alone. Are there some errors in the example or I am wrong in the comprehension of Sequential Pattern?

Hello, I would like to answer your question but I am not sure about which patterns you are talking? I think that when you submitted your question, WordPress might have removed the patterns from your comment because of the > symbols.
If you can tell me which pattern, I will answer your question.

Hi there is nothing wrong with that. Here is the point, you should look for if the dataset (the first sequence) involves the subsequence respectively. To be more clear, b is a subsequence of (a,b) in the sequence so it satisfies the rule then look for the (f,g) I believe your trouble is with b so I will not talk about the other items.

I’m curios about the difference between ‘high utility sequential pattern’ and ‘high utility itemset’. Because I’m studying on this research area in order to adopt this concept to find sequential high yield pattern from manufacturing processes. Thus, I’m trying to search several articles including ones written by yours.
While searching, I’m confused about the difference of the above two terms.
In my opinion, the main difference is ‘sequential’. Since the result of some algorithms related to ‘high utility itemset’ do not consider the order of transactions, right?
If I were right, I will able to narrow down my searching scope not ‘high utility itemset’ but ‘high utility sequential pattern’.

Hi,
Yes, that is correct. In high utility itemset mining, the time or sequential order is not considered.
In high utility itemset mining, the input is a transaction database ( a set of transactions performed by many customers).
The goal is then to find the group of items that yield a high profit when purchased together. But there is no order.
For example, you could find that {apple,bread} is a high utility itemset. It means that when bought together, apple and bread makes a lot of money. But it does not tell anything about the order.
In high utility sequential pattern mining, the order is considered.
Here the input is a sequence database. It means a set of sequences of transactions.
For example, you may have a database for 100 customers, and for each customer, you have a sequence of transactions that is sequentiall ordered.
Then you can find some patterns such as < (apple),(bread)> which means that buying apple followed by buying bread yield a high profit.

But a problem of high utility sequential pattern mining is that there is no concept of confidence.
If you have the pattern < (apple),(bread)> , you still don’t know likely it is that if someone buy Apple, he will then buy bread.
To solve this problem, high utility sequential rule mining was proposed.
It is also applied on a sequence database, and will find rules such as:
Apple –> Bread confidence : 70 % which means that this rules yield a lot of money and when someone buy Apple, 70 % of the time he will buy bread after.

Besides that, if you have a single sequence instead of many sequences, you can look at episode mining. Episode mining is similar to sequential pattern mining except that patterns are mine from a single sequence instead of many sequences.

Thanks for your reply.
This comment is quite helpful to understand those concepts.
I’ve already read an article considering “USpan Algorithm”.
I think high utility sequential pattern mining is appropriate to my research problem.
Can you recommend some article about high utility sequential pattern mining except for “USpan Algorithm”?
I’m trying to find a kind of ‘Top K’ mining algorithm for high utility sequential pattern mining, but that’s not easy.

Hello,
In general, in computer science, an algorithm is said to be robust if it can tolerate some invalid or erroneous input. How would it apply to sequential pattern mining? Just like any computer program, you want a sequential pattern mining algorithm to work well with different kind of data, and to show some errors if the input data or parameter value(s) are not what the algorithm should expect. So this is software engineering 101. When you design a program, you should test it with different kind of data and parameters to make sure that it works in all situations, and make sure that it shows error message to the user when an error message should be shown.

In simple words, in frequent itemset mining, we want to find some patterns (sets of items) that appear in many transactions. But we don’t care about the time or order between these items. For example, we could find that many customers buy bread with apple. But there is no order between bread and apple. We just know that many people buy it together

In sequential pattern mining, we also consider the order or time. Instead of finding patterns in transactions, we find patterns in some sequences. Each sequence is an ordered list of transactions. For example, we can have some sequence such that today i buy bread, tomorrow i buy milk, and after tomorrow i buy bread and milk. Thus, we have many sequences like that and we want to find the subsequences that appear in many sequences. For example, we could find the sequential pattern that many people buy milk and then after that buy bread, and then later buy coffee. Here there is some order.

In the case where all sequences only contain one transaction, then sequential pattern mining is the same as frequent itemset mining. Thus sequential pattern mining is more general.

I tried to give you the main idea in a simple way. If you want some more formal definition you could look at my survey papers on these topics which are easy to understand, and read the first few pages:

Hi Philippe,
Thank you very much for this useful introduction. I have the following questions:
1. I know a sequence is an ordered list of itemsets. Do items in an itemset have order? If not, the itemset should be a sequential pattern since it appears in SID1 and SID2.
2. I think the support of should be 3 since it appears in SID1, SID3, and SID4 and the support of should be 3 since it appears in SID1, SID3, and SID4.

I am now reading your survey paper on frequent sequential pattern mining to gain more knowledge on this field.
Happy New Year.
Dang Nguyen

Since some errors happened, I re-write my questions:
1. I know a sequence is an ordered list of itemsets. Do items in an itemset have order? If not, the itemset ({b, a}) should be a sequential pattern since it appears in SID1 and SID2.
2. I think the support of ({b}, {f, g}) should be 3 since it appears in SID1, SID3, and SID4 and the support of ({f, g}) should be 3 since it appears in SID1, SID3, and SID4.

1. Yes, ({a,b}) has a support of 2 in the above table. As you said, there is no order in the itemsets. Thus ({a,b}) is the same as ({b,a)}.
2. Yes, that is an error in the example! I have updated the example so that ({b}, {f, g}) and ( {f, g}) have a support of 3. I will update the survey paper, which contains the same error probably tonight. Thanks for reporting this issue.

Is it possible to use contiguous pattern mining for mining breast cancer data. I’m trying to mine the particular gene BRCA1 or BRCA2 to find the interesting patterns in it. Research direction is correct or not? Kindly give your valuable suggestios

It depends on what is your goal and what is the result. You can try it. If you can find some interesting patterns that are useful to understand cancer, then it will be interesting. Ideally you could find some patterns that you would validate with a domain expert (e.g. doctor) to show that these patterns are interesting.

Otherwise, you can work on a more technical level and work on improving the algorithms to add more features that are needed for this type of data.

And so on.
I am searching for any possible pattern in this data.
More precisely I want to know “When Julie sends a message?”.
And if there is any relation between Julie’s and others’ behavior.

There are also the assumptions below:
1. I do not know if there is any relation or not.
2. Always there might be new people joining the chat room.
3. There are several “usual guests” who are always taking part in conversations and are very talkative.
4. We do not know anything about these people (and their behavior). They are just a “name” for us.

Now my question is: Is there any possibility to use SPMF to extract patterns from this time series, which could potentially be used to predict Julie’s behavior according to the others’ behavior?

Yes, this is a time-series but if you ignore the timestamps, you can also see it as a sequence and apply sequential pattern mining algorithm. The key is that if you can convert your data to the same format as some existing problem, then you can use the existing algorithms.

In your case, you could view each day of the chat room as a sequence: a sequence for 2018-01-01, a sequence for 2018-01-02, a sequence for 2018-01-03. Then each sequence would be like (mike_join, mike_post_message, julie_join, julie_post, mike_quit, julie_quit…) Thus, you would have a sequence database with a sequence for each day, and then you could apply sequential pattern mining or sequential rule mining to find patterns common to severa days… Or maybe that a day is too long. Actually, if you use days, then the sequences will be very long and the algorithm may not work well, so perhaps that dividing by hours may work better.

But I am not sure that pattern mining is the best technique for this problem.

But if you convert your data to sequences, you could also try to the sequence prediction algorithm that can be trained with sequence to predice what will happen next for a new sequence.

This is just some idea. Actually, your problem may be a bit more complex, especially if you also want to consider timestamps!

I will give the “sequential pattern mining” a try and see if it provides me useful results. However, I still need to know learn what to do with the noises. For example when the internet connection of some users once in a while disconnects and users logout/login randomly.

For the learning purpose, I still would like to know if there is any possibility to detect temporal patterns in this type of time-series.
For example, can we somehow come up with the below (imaginary) model:

Glad my reply is useful. In my opinion, if you consider the time, it becomes much more complicated because you then need to decide whether (James AND [+60 seconds] Mike) is the same as (James AND [+61 seconds] Mike) or (James AND [+65 seconds] Mike). If you remove the time, then they become the same pattern. But if you consider time… then either all of these patterns are considered as different (but it is probably not what you want…) or you need to find a way to define what is similar such as 60 with more or less 5 minutes is considered as similar… Most paper that I have read in sequential pattern mining cannot do that. But it does not mean that it cannot be done. You could always design a new algorithm and you add some news of handling the time with more flexibility and write a paper about it.

Hi Philippe,
I have been reading your survey paper that you attached in the blog post. In the page 60 of the paper, I’ve read something like ” For example, the item-positions
of the pattern hfagi in the sequence s2 of Table 1 is f(s2; 1); (s2; 4)g, indicating that the pattern hfagi
appears in the first and fourth itemsts of sequence s2. “. And follow the logic of the previous sentence in the paper, I would think the statement would be ” For example, the item-positions
of the pattern hfagi in the sequence s2 of Table 1 is f(s1; 1); (s1; 4)g, indicating that the pattern hfagi
appears in the first and fourth itemsts of sequence s2. “. Hope you can consider them and show me if I’m wrong or not. Thanks.
Best,

Hi Nam,
Thanks for reading the blog and the survey, and for your feedback! I have checked it and there is indeed an error in the definition.
It should be:The item-positions of sa in sb, denoted as ip(sa, sb), is the set of pairs of the form (sb, ik),.
Then, after that the example is correct. Thanks for letting me know. I have fixed the PDF on the website.
Best regards,
Philippe