CRISP-DM: The methodology to put some order into Data Science projects

08Aug 2016

Data Science or Data Analytics techniques, which today arouse such strong interest, came up in the 1990s, when the term KDD (Knowledge Discovery in Databases) started to be used to refer to the broad concept of finding knowledge in data. In an attempt to establish standards in the area, in the same manner as it is done in software engineering with software development, two methodologies came up in the late nineties: CRISP-DM (Cross Industry Standard Process for Data Mining) and SEMMA (Sample, Explore, Modify, Model, and Assess). Both define a set of sequential steps to guide the process, assigning specific tasks and defining the results that are expected to be obtained in each stage.

(Azevedo and Santos, 2008) compare both methodologies and draw a parallel between them. CRISP-DM is extremely complete and it is applied from a business perspective. That is the reason why it was popularly adapted: for instance, KDNuggets surveys in 2002, 2004, 2007 and 2014 verified that CRISP-DM was the main standard used, four times more than SEMMA. We use that CRISP-DM methodology in our projects. In this post, we describe CRISP-DM methodology, with its objectives, phases and tasks, derived from the consortium of companies that proposed the methodology (Chapman et at., 2000).

Introduction

CRISP-DM (Cross Industry Standard Process for Data Mining) provides an overview of the life cycle of a data mining project, in the same manner as it is done in software engineering with life cycle of software development. CRISP-DM process covers the phases of a project, their respective tasks, and the relationships between them. At this description level, it is not possible to identify all relationships. Relationships could exist between any data mining task depending on goals, context and user’s interest on data.

CRISP-DM methodology considers data analysis process as a professional project, creating a much wider context which has a bearing on the model’s creation. This context considers the existence of a customer who is not part of the development team, and also the fact that the project is not ended after finding the right model, as afterwards it requires an adequate deployment and maintenance. However, the project is link to other projects and it needs to be precisely documented so that other development teams could use and work with the knowledge gained.

The life cycle of the data mining project comprises six stages shown in the figure below.

Steps’ sequence is not rigid. Moving back and forth between different phases is always required. The outcome of each phase determines which phase or specific task of a phase needs to be performed next. Arrows indicate the most important and frequent dependencies between phases.

The outer circle in the figure symbolizes the cyclical nature of data mining itself. Data mining project does not end once a solution is deployed. Data learned during the process and the deployed solution can trigger new questions. Subsequent data mining processes will benefit from previous experiences.

Hereafter, we briefly outline each phase.

Phase I. Business Understanding (defining customer’s needs)

This initial phase focuses on understanding the project objectives. Then, this knowledge is converted into a data mining problem definition and into a preliminary plan designed to achieve the objectives.

Phase II. Data Understanding (data study)

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.

Phase III. Data Preparation (data analysis and feature selection)

The data preparation phase covers all activities needed to construct the final dataset from the initial raw data. Dataset means the data that will be fed into the modeling tools. Data preparation tasks include table, record, and attribute selection, as well as data transformation and cleaning for modeling tools.

Phase IV. Modeling

In this phase, modeling techniques are selected and the ones that are relevant to the problem (the more techniques, the better), and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the data form. Therefore, going back to the data preparation phase is often necessary.

Phase V. Evaluation (outcome collection)

At this stage in the project, you have already built one or several models that appear to have high quality from a data analysis perspective.

Before proceeding to a final deployment of the model, it is important to thoroughly evaluate it and review the executed steps in order to create it. That means comparing the obtained model with the business objectives. Determining if there is some important business issues that have not been sufficiently considered is a key objective. At the end of this phase, a decision on the use of the data mining results should be reached.

Phase VI. Deployment (put into production)

Normally, model’s creation is not the end of the project. Even if the purpose of the model is to increase data knowledge, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable and may be automatic data mining process across the enterprise.

Conclusions

The following figure shows a visual guide of the phases, listing every to-do task for each phase, as well as the connections between them and interactions that might be carried out in the process. Figure from “A visual guide to CRISP-DM methodology”.

The consortium that proposed the CRISP-DM methodology was dissolved few years ago. However, CRISP-DM is the de facto methodology used in data mining projects, which are intended to address outcome quality. Recently, a new initiative called “crisp-dm.eu” has come up, but without great impact so far.

In 2015, IBM Corporation, traditional promoter of CRISP-DM, suggested a new methodology named as Analytics Solutions Unified Method for Data Mining/Predictive Analytics (ASUM-DM), wider than CRISP-DM and part of ASUM general methodology (Analytics Solutions Unified Method) incorporated into IBM products and analytic solutions. Its success or not in the Data Science community will be seen over the time.

Our professional team can effectively address Data Analytics projects in any complex scenario with the maximum guarantees of success, applying CRISP-DM methodology in a professional and consistent way, but also being practical by combining agile methodologies. Of course, if you would like more information about this area, please do not hesitate to contact us. We will be glad to help you!