Post navigation

Today, I will discuss how to choose conferences for publishing papers. It is important to make good choices because it can have a huge impact on a researcher’s career.

There are several things to consider for choosing a conference.

First, does the conference have a good reputation? Obviously, it is better to publish papers in conferences that have very good reputation. If someone publishes in good conferences, it will give more visibility to his work, and thus, it is more likely that other researchers will cite his papers. It will also look better in his CV and gives him a better chance of getting hired for jobs or getting grants.

However, the best conferences are sometimes very selective. For example, the top conferences in some fields like data mining can have acceptance rate below 10 %, which means that only 1 paper out of 10 or even less may be accepted. Therefore, it sometimes makes sense to submit papers to conferences with lesser reputation. However, one should avoid at all cost the conferences that have bad reputations. For instance, I know that for hiring professors at some universities, if someone has published in some conferences that I will not name, it will negatively affect the candidate’s chance of getting hired.

Another thing to consider is how difficult it is to get a paper accepted at a given conference? For top-level conferences, one needs to have very good research results to get accepted and also to write the paper very well. So it is important to ask this question: Does my paper has a good chance of getting accepted? To answer this question, one may read papers that were published in the conference proceedings the years before. It will give him an idea about how hard it is to get accepted. One thing that every researcher should know is that the “acceptance rate” of a conference that is sometimes advertised does not always reflect very well the difficulty of having a paper accepted. For example, some top conferences could have a 10 % acceptance rate, while some other may have a 20% acceptance rate. But it does not means that it is twice harder to get accepted for the former conference. Actually, the one with a 10 % acceptance rate could be much harder if it is a conference with a good reputation because it will be 10 % of the best papers instead of 20 % of some average papers.

Another important aspect to consider for choosing a conference is the location of the conference. The location is important because it does not cost the same amount of money to travel to every countries. Moreover, the registration fee of some conferences is cheaper than some others. Besides, a researcher should also think about which conference will provide him with the best opportunity to meet researchers that could be interested in his research to build collaborations and give him the best visibility.

The deadline of a conference and the review time is also important. I personally recommend to write down the conference dates and notification date of several conferences, and then to use this to make a plan. Where should I submit? If the paper is not accepted at conference A, then where could I submit my paper after that?

Also one should consider the format. This is very important because the format of papers and the maximum number of pages can vary widely from one conference to another. Moreover, one should check carefully if the pages are single-column or double-column. This can also make a huge difference on the overall length of the paper.

One should also check who publish the conference proceedings. Does the proceedings are published by a serious publisher or are they printed by the conference organizers at a local store? I recommend to only publish papers in conferences that are published by serious publishers and/or indexed in publication databases in your field. This is important because if someone publish papers in conference that are not indexed, in ten years from now, it is possible that nobody would know that these papers ever existed.

Another aspect to consider is the topic of the conference. Let’s say that someone is working on developing data mining algorithms that are applied to educational data. He could publish his research in several different conferences depending on the topic of the conferences. For example, he could publish his research in an educational data mining conference. He could submit to a data mining conference (educational data mining is a subfield of data mining). Alternatively, he could publish in an artificial intelligence conference (data mining is a subfield of artificial intelligence). Or he could publish in a very general Computer Science conference (artifical intelligence is a subfield of Computer Science). My advice is to not choose a conference that is too general.

Those are my advices for choosing a conference. Hope that this helps you! If you have some additional thoughts, please share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!

I have seen many people asking for help in data mining forums and on other websites about how to choose a good thesis topicin data mining. Therefore, in this this post, I will address this question.

The first thing to consider is whether you want to design/improve data mining techniques, apply data mining techniques or do both. Personally, I think that designing or improving data mining techniques is more challenging than using already existing techniques. Moreover, you can make a more fundamental contribution if you work on improving data mining techniques instead of applying them. However, you need to be aware that improving data mining techniques may require better algorithmic and/or mathematics skills.

The second thing to consider is what kind of techniques you want to apply or design/improve? Data mining is a broad field consisting of many techniques such as neural networks, association rule mining algorithms, clustering and outlier detection. You should try to get some overview of the different techniques to see what you are more interested in. To get a rough overview of the field, you could read some introduction books on data mining such as the book by Tan, Steinbach & Kumar (Introduction to data mining) or read websites and articles related to data mining. If your goal is just to apply data mining techniques to achieve some other purpose (e.g. analysing cancer data) but you don’t know which one yet, you could skip this question.

The third thing to consider is which problems you want to solve or what you want to improve. This requires more thoughts. A good way is to look at recent good data mining conferences (KDD, ICDM, PKDD, PAKDD, ADMA, DAWAK, etc.) and journals (TKDE, TKDD, KAIS, etc.), or to attend conferences, if possible, and talk with other researchers. This helps to see what are the current popular topics and what kind of problems researchers are currently trying to solve. It does not mean that you need to work on the most popular topic. Working on a popular topic (e.g. social network mining) has several advantages. It is easier to get grants or in some case to get your papers accepted in special issues, workshops, etc. However, there are also some “older” topics that are also interesting even if they are not the current flavor of the day. Actually, the most important is that you find a topic that you like and will enjoy working on it for perhaps a few years of your life. Finding a good problem to work on can require to read several articles to understand what are the limitations of current techniques and decide what can be improved. So don’t worry. It is normal that it takes time to find a more specific topic.

Fourth, one should not forget that helping to choose a thesis topic is also the job of the professor that supervise the Master or Ph.D Students. Therefore, if you are looking for a thesis topic, it is good to talk with your supervisor and ask for suggestions. He should help you. If you don’t have a supervisor yet, then try to get a rough idea of what you like, and try to meet/discuss with professors that could become your supervisors. Some of them will perhaps have some research projects and ideas that they could give you if you work with them. Choosing a supervisor is a very important and strategic decision that every graduate student has to make. For more information about choosing a supervisor, you can read this post : How to choose a research advisor for M.Sc. / Ph.D ?

Lastly, I would like to discuss the common question “please give me a Ph.D. topic in data mining“, that I read on websites and that I sometimes receive in my e-mails. There are two problems with this question. The first problem is that it is too general. As mentioned, data mining is a very broad field. For example, I could suggest you some very specific topics such as detecting outliers in imbalanced stock market data or to optimize the memory efficiency of subgraph mining algorithms for community detection in social networks. But will you like it? It is best to choose something by yourself that you like. The second problem with the above question is that choosing a topic is the work that a researcher should do or learn to do. In fact, in research, it is equally important to be able to find a good research problem as it is to find a good solution. Therefore, I highly recommend to try to find a research topic by yourself, as it is important to develop this skill to become a successful researcher. If you are a student, when searching for a topic, you can ask your research advisor to guide you.

In this post, I will discuss what it takes to be a good data mining programmer and how to become one.

Data mining is a broad field that can be approached from several angles. Some people with a mathematical background will employ a statistical approach to data mining and use statistical tools to study data. Others will use already made commercial or open-source data mining software to analyses their data. In this post, we will discuss the computer science view of data mining. It is aimed at programmers who would like to become good at implementing and designing data mining algorithms.

There are some great benefits to not just be a user, but to be a data mining programmer. First, you can implement algorithms that are not offered in existing data mining tools. This is important because several data mining tools are restricted to a small set of algorithms. For example, if you consider data mining tasks such as clustering, there are hundreds of algorithms that have been proposed to handle many different scenarios. However, general purpose data mining tools often only offer just a few algorithms. Second, you can download open-source algorithms and adapt them to your needs. Third, you could eventually design your own data mining algorithms and implement them efficiently.

So now that we have talked about the advantages, let’s talk about how to become a good data mining programmer. We can break this down into two aspects: being good at programming and being knowledgeable at computer science in general, and being good at programming data mining algorithms.

To be good at programming, you should have good knowledge of at least one programming language that you will use. Choosing a programming language is important because performance is generally important in data mining. So you may go for a language like C++ that will compile to machine code, or some languages like Java or C# that are reasonably fast and can be more convenient to use. You should avoid web languages such as PHP and Javascript that are less efficient, unless you have some good reasons to use them.

After that, you should try to get a good knowledge of the data structures that are offered in your programming language. A good programmer should know when to use the different data structures. This is important because you will eventually optimize your algorithms. In data mining, optimizations can make the difference between an algorithm that will run for hours or just a few minutes, or use gigabytes or megabytes of memory! So you should get to know the main data structures that are offered such as array lists, linked list, binary trees, hash tables, hash sets, bitsets, priority queue (heaps). But more importantly, you should know that there are many data structures that are not offered with your programming language. You should know how to look up in books or websites for other data structures.

Besides, you should try to get better at algorithmic (designing efficient algorithms) and computer science in general. There are many different way to do that such as taking courses on this topic or to read some books. But most importantly, you need to to put the theory into practice and to do some programming, which leads me to the key part of this post.

To become good at programming data mining algorithms, you need to write data mining algorithms. To get started, you should read some data mining books such as the book by Tan, Steinbach & Kumar, or the book by Han & Kamber. I recommend to start by implementing some simple algorithms without optimizations. For example, K-means or Apriori are relatively easy to implement. After you have debugged and checked that your implementation generates the correct result, you should spend time to think about how to optimize it. First, think about optimizations by yourself. Then look at how other people did it by looking at websites, articles or by looking at the code of other people. Most likely, there are many optimizations that have been proposed. After that, you could implement the optimizations, and then look at more complex algorithms. Finally, remember that Rome was not built in a day. Give yourself some time to learn!

I have obviously not mentioned everything. In particular, being good at mathematics is also important. If you have some additional thoughts, you can share them in the comment section. By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about next blog posts.

Today, I will discuss about why it is important that researchers share their source code and data.

As some of you know, I’m working on the design of data mining algorithms. More specifically, I’m working on algorithms for discovering patterns in databases. It is a problem that dates back to the 1990s. Hundreds of papers have been published on this topic. However, when searching on the Web, I found that there are very few source code or even binary files available. On some specialized topics like uncertain itemset mining, there is for example about 20 algorithms published but about only two papers that provide the source code and datasets.

This is a serious problem for research.

First, some of these algorithms are hard to implement. For some people that are not familiar with the subject or that are average programmers, it is a huge waste of time to implement the algorithms again and this could deter them from using the algorithms. As some people say: why reinvent the wheel ?

Second, algorithm descriptions that are provided in research papers are often incomplete due to the lack of space. Some researchers will not provide optimizations details due to the lack of space. Or some researchers will intentionally not provide enough details in their paper so that other people cannot implement their algorithm properly and beat its performance.

Third, let’s say that someone develops a new algorithm and want to compare its performance with an already published algorithm. If this person cannot find the source code or binary files of the published algorithm, he has to implement it by himself. However, this version will be different from the original and depending on how it is implemented, the comparison could potentially be unfair.

Now, let’s talk about what are the advantages of sharing your source code and data.

First, as a researcher, if you publish your source code, it is much more likely that someone will use your algorithm or application. If someone use your algorithm/application, he will cite you, and it will provide benefits to you.

Second, other researchers can save time if they don’t have to implement again the same algorithms. They can use this time to do more research. And therefore, this would benefit the whole research community.

Third, if you are the author of an algorithm, other people can compare with your version of your algorithm. By sharing your source code, you are therefore sure that the comparison will be fair.

Fourth, other people are more likely to integrate your algorithm/software in other software or to modify it to develop new algorithms/software. Again, this will benefit you because these people will cite you. And the more people will cite you, the more people will read your papers and will cite you.

update in 2018: Now to conclude, I will talk about the benefits that I have received from sharing my work as open-source software since the last few years. I’m the author of the SPMF data mining software. This software offers more than 100 algorithms, most of them implemented by me, including a dozen that are my own algorithms. Since about 8 years, the website has received more than 500,000 visitors and the software has been cited in more than 500 research papers and journal articles. Some people have applied the algorithms in biology, website clickstream analysis and even chemistry. This has also greatly contributed to increasing the citations of my research papers.

I hope that this blog post will convince you that it is important to share the source code and the data of your work with other researchers.

By the way, if you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about next blog posts. Also, please leave a comments below if you have some additional thoughts or story about this.

This is the first post. This blog is going to be updated weekly (or more often, if I have time). It is going to talk about data mining news and other topics related to data mining, or just research and algorithms in general. I will write text, discuss code and just share some thoughts. Hope you will enjoy it!

By the way, I’m a computer science professor at a university in Canada. I have done research on various topics including data mining, intelligent tutoring systems, cognitive modeling, etc. I’m the author of an open-source data mining software that you can download here: http://www.philippe-fournier-viger/spmf/