What are the steps to implement a data mining algorithm?

In this post, I will discuss what are the steps that I follow to implement a data mining algorithm.

The subject of this post comes from a question that I have received by e-mail recently, and I think that it is interesting to share the answer. The question was :

When we do he programming I’ve observed a good programmer always makes a flow chart, data flow diagram, context diagram etc. to make the programming error free. (…) In your implementation of XXXX algorithm, have you performed any procedures as discussed above? (…) Can you explain me what are the basic procedures that you have applied to implement XXXX algorithm?

When I implement a data mining algorithm, I don’t do any flowchart or UML design because I consider that a single data mining algorithm is a small project. If I was programming larger software program, I would think about object-oriented design and UML. But for a single algorithm, it does not need to be an object oriented design. Actually, what is the most important for a data mining algorithm is that the algorithm produces the correct result, is fast and preferably that the code is well-documented and clean.

Step 1: understanding the algorithm. For implementing a data mining algorithm, the first step that I perform is to read the research paper describing it and make sure that I understand it well. Usually, I need to read the paper a few times to understand it. If there is an example in the paper I will try to understand it and later I will use this example to test the algorithm to see if my implementation is correct. After I read the paper a few times, I may still not understand some details about the algorithm. But this is ok. There is often some tricky details that you may only understand when doing the implementation because the authors do not always give all the details in the paper, sometimes due to the lack of space. Especially, it is common that authors do not describe all the optimizations and data structures that they have used in a research paper.

Step 2: implementing a first draft of the algorithm step by step. Then, I start to implement the algorithm. To implement the algorithm I print the pseudocode and i try to implement it. For an algorithm that takes a file as input, I will first work on the code for reading the input file. I will test this code extensively to make sure that I read the input file correctly. Then, I will add additional details to perform the first operations of the algorithm. I will check if the intermediary result is correct by comparing with the example in the paper. If it is not correct I will debug and maybe read the paper again to make sure that I did not make a mistake because I did not understand something in the paper. Then I will continue until the algorithm is fully implemented.

Step 3: testing with other input files. When my implementation becomes correct, I will try with a few more examples, to make sure that it is not correct for a single example but that it can provide the correct result for other input files.

Step 4: cleaning the code. After that, I will clean my code because the first draft is likely not pretty.

Step 5: optimizing the code. Then I will try to optimize the algorithm in terms of (1) using better data structures, (2) simplifying the code by removing unecessary operations, etc. For example, for my implementation of PrefixSpan in my SPMF data mining software, I first made a very simple implementation without an optimization called pseudo-projection that is described in the paper. It was very slow. After my implementation was correct, I took a few days to optimize it. For example, I added the pseudo-projection, I also added code for another optimization which is to remove infrequent items after the first input file scan, I removed some unnecessary code that I had left, I reorganized some code, I added some comments, etc.

Step 6: Comparison of the performance with other implementations of the same algorithm / peer review. After my code is optimized, as an optional sixth step, I may compare the performance of my implementation with other implementations of the same algorithm if some are available on the Internet in the same programming language.If my implementation is slower, I may look at the source code of the other implementation to see if there is some ideas that I have not thought of that could be used to further optimize my code. I may also ask some of my friends or colleagues to review my code. Another good way is to not show the code to your colleague but just to explain them the main idea to get their feedback. Discussing with other people is a good way to learn.

It takes times… Note that being good at programming data mining algorithms takes time. For example, let me tell you about my story. The first time that I implemented data mining algorithms was in december 2007. I implemented the Apriorialgorithm for a course project at university. My implementation was terrible and slow… But it generated the correct result. I then implemented PrefixSpan in 2008. At that time, my code was better because I was gaining some experience on implementing this kind of algorithms. Then in 2010, I read my Apriori and PrefixSpan code again and I still found some problems and I optimized them again. What I want to say here is that it is normal that the first implementation(s) of data mining algorithms that one person makes may not be very good. But after implementing a few algorithms, it becomes much easier. Now, we are in 2013 and I have implemented more than 45 algorithms in my open-source SPMF Java data mining software!

This is what I wanted to write for today. Hope that you enjoyed this post. If you want to read more on this topic, you may be interested by my post on How to be a good data mining programmer. Lastly, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

57 Responses to What are the steps to implement a data mining algorithm?

Hello Sir,
it is very useful information indeed.
Right now I am pursuing masters in Comp. Sci.
My topic for dissertation is Finding Outlier Patterns in Time-Series Sequences.
As my Input Data will be Time Series.
I want to know which tools should I use and,
The SPMF you mentioned here can be used for Time Series Database or not.
it would be great if you could guide me for this implementation.
Waiting for your reply.

Hello,
Time series usually refers to sequences of numeric data.
SPMF is for symbolic data (sequences of symbols).
It would be possible to apply SPMF on time series data but you would need to convert from numeric data to symbolic data, and this may not be the best choice.

I don’t know any software for time series data but they may exist some.

(see the download section to download the source code and how to install it and use it)

Where to store the input data? It depends. If it is just for a homework, then, you could just use text file. It would be easier to deal with text files than with a database. But you could certainly use a table from an SQL database instead of text files if you want. In my implementation of ID3, I just use text files because it is more simlpe and one could always export a table from a database to a text file.

Where to get information about ID3? You could have a look at classic books about artificial intelligence. Most AI books have some chapters explaining ID3 since it is a very popular and simple algorithm. For example, you could look at the book “Machine Learning” by Tom Mitchell etc. There is also many powerpoints on the internet explaining ID3.

Hello sir,
Thank you for the information, it motivated me a lot.
I want to implement a data mining project for the graduation.
Can you suggest some good real-time data mining apps which I can choose as a project…???

I have implemented data mining algorithm in java. I want to compare results with other algorithms. i want to compare the results on some attributes. Is there any tool available for getting the results based on attributes on a graph

I want to do data mining on my database to predict the behavior of some fields.My data is of float type and the unique key which I’m using is of date time type.I ahve a huge database.
My whole scenario is something like that

I have four different database namely INDUS2_BDS,INDUS_VACUUM,MSIS and RF.All these database have tables which have logtime as the unique key.From INDUS2_BDS database I consider a table from which I retrieve the beam_current and its corresponding logtime.
Now from rest three database different columns values or parameters corresponding to the above logtime of INDUS2_BDS will be considered.Now I will apply mining to theses retrieved values and beam_current to check what was the values of all parameters against beam_current and what pattern it follows and how values of different parameters changes when beam_current changes.

Moreover I want to predict the beam_current value by looking at all other various parameters on which beam_current depends.This dependency also I want to display through mining technique.
Can you suggest me if it is good to use SPMF for this and which algorithm I should use for mining?

First, you can apply some basic statistics to better know your data. For example, you could calculate correlation between an attribute and your target attribute to see if there is some correlation. You could also use vizualization techniques such as scatter plot with two attributes as X and Y axis to see if there are some relationships between some pairs of attributes.

Besides, you could apply some more or less complciated data mining techniques or techniques. If you think that the relationship between your attributes is linear you could try something like multiple linear regression with multiple attributes.

Other techniques that could be applied is classifiers such as neural networks, svm etc. A neural net can take numeric data as input and predict numeric data as output and it can learn non linear relationships between attributes.

I think that you may need to try various techniques and depending on how your data really is, different things may work better or not.

Another idea is to preprocess your data. Data may need to be normalized to achieve better results. Maybe also that it make sense to discretize the data? I just try to give you some general ideas.

Sir, actually i’m a beginner in data mining topic and i want to do my thesis on it . I’ve read 2-3 papers on security sensitive data mining and i wish to know the fields for this topic and the tools that i should use. It would be great if you could tell. waiting for the reply.

I am doing masters in Computer Science,trying to implement below flowchart since 2yrs..but nobody is able to help me on this….not even my project guide.

It will be great help if u can give me implementation or code for the same.

A New Multiobjective Evolutionary Algorithm for
Mining a Reduced Set of Interesting Positive and
Negative Quantitative Association Rules

Flowchart of the algorithm
According to the above description, the proposed algorithm
for mining PNQARs can be summarized in the following
flowchart.
Input: 1) N population size;
2) nTrials number of evaluations;
3) m number of objectives;
4) Pmut probability of mutation;
5) λ1, …λN a set of N weight vectors;
MART´IN et al.: NEW MULTIOBJECTIVE EVOLUTIONARY ALGORITHM 59
6) T the number of weight vectors in the neighborhood
of each weight vector;
7) δ the probability that parent solutions are selected
from the neighborhood;
8) ηr the maximal number of solutions replaced by
each child solution;
9) γ factor of amplitude for each attribute of the
dataset;
10) α difference threshold.
Output: EP
Step 1: Initialize.
a) Compute the Euclidean distances between any two
weight vectors and then work out the T closest
weight vectors to each weight vector. For each i =
1, …,N set Bi = {ii, …
iT } where λi1 , …λiT are the T closest weight vectors
to λi.
b) Generate the initial population with N chromosomes.
c) Evaluate the initial population.
d) Initialize z = (z1, …, zm) by setting zj =
max1≤i≤N fj(xi), j = 1, …,m.
e) Initialize the EP.
Step 2: Update. For each i = 1, …,N do the following.
a) Uniformly randomly generate a number rand from
[0,1]. Then set
P =

B(i) if rand < δ
{1, …,N} otherwise.
b) Set r1 = i and randomly select r2 from P. The
solutions xr1 and xr2 are crossed, generating two
offspring g: y1 and y2. Next, the mutation and repairing
operators are applied for the two offspring.
c) Evaluate the new individuals. For each yk, k ∈
{1, 2}.
i) Update of z: For each j = 1, …m, if zj <
fj(yk), then set zj = fj(yk).
ii) Update solutions: Set c = 0 and then do the
following.
A) If c == ηr or P is empty go to Step 3.
Otherwise randomly pick an index l from
P.
B) If g(yk|λl, z) ≤ g(xl|λl, z), then xl = yk and
c = c + 1
C) Remove l from P and go to a).
Step 3: Update of EP: remove from EP all the vectors dominated
by i, i = 1, …,N, then add i to EP if no vectors
in EP dominate it.
Step 4: Remove redundance from the EP.
Step 5: If the difference between the current population and
previous population is less than α%, restart the population.
Step 6: If the maximum number of evaluations is not reached,
go to Step 2.
Step 7: Remove redundance from the EP.
Step 8: The EP is returned.

Hi
I am also doing my master thesis using Artificial Bee Colony for association rule optimization. I also need a help how to initialize the association rules into artificial bee colony. If you can please guiding me.

My name appears on some papers related to this topic but I was not the main author so i cannot give much details about how it works. I think you can read the papers to find more information, or contact the corresponding authors of the papers.

Hello sir
I am doing my final year project single no team so I need ur help badly
By topic is Predicting student performance by using data mining method for classification
I am using KNN algorithm for implementation
I know how it works with examples I hv learned
I know to calculate the distance too by euclidean distance formula
But I need to detail description how to implement in java and on which platform to run the program !!!! Plzzz I request sincerely tobreply for this soon as possible r send ur mail id I will mail the detail s so that I can me touch till my project finish s and further after I finish too

Hello, I don’t have time to implement or give details about this algorithm. I am very busy, and since it is a kind of homework/project it is better that you do it by yourself. There is a lot of places where the KNN algorithm is being described (e.g. wikipedia) so you can read that. Besides, if you know how to use Java, it should not be too hard to write a simple version of KNN. It could probably be written in less than 100 lines of code.

KNN is a very popular algorithm. You can certainly even find some source code in Java by searching a little bit on Google if you need inspiration about how to implement it. Besides, it is described in many books.

For the data, you need to find some dataset to test your algorithm. You may search to find some data, and then test the data and see what works best. Normally, you will not know beforehand what attributes will work best. You need to try it. Also, which attributes are available depends on your dataset, so it is impossible to know what attributes you should take before you find your dataset. Besides, what attributes you should use depends on your goal. For example, if you want to predict which student will fail or if you want to predict which student play video games, then you will certainly not use the same attributes to make the predictions.

I’m pretty certain that some papers have been already published on this topic. You could check these papers and see which attributes they have used. The topic of predicting pass or fail has been certainly studied before. I remember ever reading a paper about that but it is a long time ago. But you can find some by searching on Google Scholar.

Besides that, as I said, you need to try it. You cannot know beforehand which method or which attribute will work before you try it. You can use common sense knowledge to determine that some attributes are relevant or not. But ultimately you will still need to try it with real data to see really which attributes works or not.

Dear prof. Philippe Fournier-Viger,a),
iam a Phd student i would like to get your advice on the following suggested topics:
1-internet of things data mining.(RFID)
2- social networks data mining
these are general idea i dont know which one is the best and what data mining technologies can be used in this case these data is massive dynamic real time data(big data ) , can i use same data mining algorithms for should i update it.
best regars

Those are good research areas as they are popular now. Since these research areas are quite broad, i think that almost any data mining algorithms could be applied in either of them depending on specifically what you will do. But for sure, some methods are more popular. For example in social networks, if you like to work with graphs, than you can work on algortihms for data minin on graphs, although in IoT, you can also work on graphs. I suggest to read papers on these topics to see really what you like and what people are doing. You will get a better idea of what people are doing.

In social network mining, if you want to consider the time dimension, yes. But not all researchers consider the time dimension in social network mining. I think that most public datasets do not consider time. You may find some datasets for exmaple here: http://snap.stanford.edu/data/

You could apply some basic algorithm like regression, or do something more complicated like clustering similar students into groups and then trying to predict each group separately. I think it depends on what you have. The more data you have, probably the more interesting the results would be. Also you may need to try various approaches to see which one works better. But i’m certain that many people have worked on this problem before. You could search papers on Google Scholar and read them to see how they have done it.

Hello sir I need ur help in coding part plzzz do support me to this !!!just support be by providing a link at least
It’s regarding the CG project the topic chosen is aerial tram !!! So please do needful

I need a basic idea if how to use the data sets and what are the steps to pre-process ,clean and post process? If explained with an example it would give us the basic understanding to work on any topic.There is no where and no much idea where there is no much information of implementation of raw data sets and steps to follow it.

Hi, many different techniques can be used for pre-processing and cleaning the data. One of the reason why people pre-process a dataset is that real-life data is often incomplete (some values are missing), noisy, may contains duplicate data, may contains some errors, and there also might be some inconsistencies in the data (for example, sometimes only the family name of a person is indicated in some database record while sometimes the full name is used). In other words, a real world database may contains many problems related to data. If you apply data mining techniques on bad data, then there is a chance that the knowledge extracted by a data mining algorithm will also be bad. So that is one of the reason for doing pre-processing and cleaning the data.

The other reason is that in a database, not all the data is relevant for a data mining task. For example, if you have customer records of a bank, it may contains a lot of information about people. But not all this information is relevant for all data mining tasks. Thus often, some people will pre-process the data to remove the irrelevant information before doing data mining or to transform the data in a way that is appropriate for the data mining algorithms that will be applied. For example, to cluster bank customers, one may decide to ignore data such as the phone number, and the address, since it may not relevant for calculating the similarity between customers. Moreover, one may decide to transform the data so that it is suitable for an algorithm. For example, if you decide to apply ID3 for classification on some data, the ID3 algorithm cannot deal with numeric data. Thus, you have to do some pre-processing to transform the data from numeric to nominal values to apply ID3.

Now, how to do it? It depends on (1) what kind of problems you have in your database (inconsistencies, missing data, errors, etc.), (2) the assumptions of the data mining that you want to apply (3) your goal for extracting data.
You may check the book by Han & Kamber about data mining to see the particular techniques. Chapter 1 and 2 if I remember talks about data cleaning, data transformation and pre-processing. It talks about various techniques. I will just give you a brief example. For the problem of missing values, there are at least six approaches to deal with this, each having advantages and disadvantages:
(1) You could delete the records having missing values
(2) You could ignore the attributes having missing values
(3) You could replace the missing values by a constant such as “unknown”
(4) You could manually edit each record to add the missing values
(5) You could use a measure such as average, or median to fill the missing values
(6) You could try to predict the missing values using regression or machine learning techniques such as neural networks
(7) …

So actually, there is a lot of different ways to do pre-processing and cleaning. This is why, I will not explain all of them here, but I recommend you to read the first chapter(s) of the book by Han & Kamber.

hello sir , im a final yr BE comp sci student , doing my project on heart rate monitoring ,alarm system and app in real time …
i want to use id3 algo for my project . the main idea is the real time heart rate which we get hav to b taken as in input to the id3 algo … those has to go to some conditions which tells the certainity and uncertainity for heart health condition(meaning good or bad) and this info ios to b taken from the database where the dataset is stored on which the output depends …. (i need help here)…(can u send me some source code if u hav for similar kind of thing) and aftr the algorithm has finally ready with the result of a persons health condition this output is to b sent to the app (andriod) …(sir again here ?? how to do ?? any source code if u hav plzz forward me)….
n
sir if u hav any good research pprs on id3 algo or any source code plzz help me with it…
thank u

Since ID3 does not take numeric values as input, to apply ID3, you would need to discretize the heart rate. For example, you could use some bins such as: heart rate between 0 to 50, heart rate between 50 and 70, etc.

I don’t have any code related to Android or other things. Actually, all my source code is in the SPMF library.

Hello Sir,
it is very useful information indeed.
Right now I am pursuing masters in Comp. Sci.
My topic for dissertation is Predicting the users future location in mobile Environment.
As my Input Data i am planning to use chess dataset(http://www.philippe-fournier-viger.com/spmf/datasets/ndatasets/chess_negative.txt).
I want to know whether is it appropriate for my project? if you could suggest anyother dataset then please provide me the link, also can you please tell which algorithm will be good for my project.
it would be great if you could guide me for this implementation.
Waiting for your reply.

Hello, predicting future location is an interesting topic. But Chess is not an appropriate dataset because it is about moves for the chess game rather than sequences of locations. It would be best to find some datasets of GPS locations from mobile phones or to even collect your own from a group of students, for example. A dataset like chess can be ok to evaluate the speed of your algorithm. But it will not demonstrate that you can do good recommendation for mobile location prediction, since it is data about the chess game. So I recommend to find some real datasets. If you cannot find, you can always contact the other of some papers that have such datasets and ask them if they can share with you.

For the algorithms, various algorithms can be used such as sequential pattern mining algorithms, sequence rule mining algorithms, Markov model, and some models like CPT+. You can check my SPMF library. It contains several sequence prediction models, and you can get the source code.

I did not use real data about mobile locations in my research so I am not sure where you can find this. But there is a lot of data available on different websites. Sometimes researchers will put their datasets on their website. Or sometimes, if you e-mail the authors of a paper, you may ask them to share their data that they have used in their studies with you. You need to search a little bit. It may not be so easy to find data, but it is an important step to show that your algorithm or work is useful.

Hello sir,
I recently studied your paper where you have discussed about CPT+. I have the below questions related to it:
1] In your previous paper(CPT) you have used prefix length concept to traverse the prediction tree whereas in this paper(CPT+) you have mentioned suffix length, which should be considered then?
2] How do we calculate TB and MBR ? can you please help me in understanding the concept behind using the formula for them.
3] On what basis the threshold(1.002) was decided.
Need your expertise on the above.

hello sir, I designed an algorithm that would help perform analysis based on budget of an individual to come to my country using shortest path algorithm but i dont know where to implement it so that it access the database and display result on a web page.

Hello, you could use a free database system such as MySQL and create a database. Then, your program could connect to the database and do some queries on the database, and then your program could generate some webpages as HTML file by writing some text file (a html page). If your program is in C++ or Java or similar languages, then you need to understand how to use these languages to connect to the database and do the SQL queries and how to write a text file. Otherwise, maybe that it would be more simple to write your program directly in PHP if your goal is to design a dynamic website rather than a program. But PHP is slower than C++ and Java. So if your algorithm is very slow, maybe it is better to write it in C++ or Java or something else. And if you want to make a dynamic website, you could use something like PHP (as PHP is quite easy and specially designed for designing websites). It is up to you.

I truly appreciate that you reply to people. I’m curious how to apply big data/algorithms/combing the internet for the benefit of public citizens. I don’t know how to code at all but I do know that I’d like to help my city. How can we for instance find all the STEM programs in the city short of only using Google? How can we find better matches and a seamless real time experience from job training programs to jobs to reduce unemployment, say for instance the way that an air traffic controller does at an airport and you can see inbound and outbound flights on the screen? There’s got to be ways to do civic innovation, I think the things I’m describing as easier than flying to the moon certainly.

If I don’t have time to learn how to code? Who can I talk to? Next steps? Where do I start?

Those are some interesting ideas. Those could certainly be done. But yes, it would requires some programmer(s) to design some database for storing the information and some website or application for the user. Besides, for the project about jobs, it would also requires to convince companies to advertise their job on that system or to partner with existing job websites to access their data. Such system could generate some profit. It would be like starting a small business. So in my opinion, this is not a small project. If you have funding and time, you could always hire some programmers and start a business with some of these ideas. Or if you can get government funding, maybe that it could support developing something.

Hi,
For CSV file, if you have some data mining tool that handle CSV file, then no problem. Other you need to use software or write some code to convert this format to the format needed by your data mining tool. Or write your own data mining tool 😉 For noisy data, it depends on your data and what kind of noise it contains. If there is no noise, then dont need to do anything. So it depends what you think is noise.