Abstract

We consider the problem of finding association rules that make nearly optimal binary segmentations of huge categorical databases. The optimality of segmentation is defined by an objective function suitable for the user's objective. An objective function is usually defined in terms of the distribution of agiven target attribute. Our goal is to find association rules that split databases into two subsets, optimizing the value of an objective function.

The problem is intractable for general objective functions, because letting N be the number of records of a given database, there are 2N possible binary segmentations, and we may have to exhaustively examine all of them. However, when the objective function is convex, there are feasible algorithms for finding nearly optimal binary segmentations, and we prove that typical criteria, such as "entropy (mutual information)," "x2 (correlation)," and "gini index (mean squared error)," are actually convex.

We propose practical algorithms that use computational geometry techniquesto handle cases where a target attribute is not binary, in which conventional approaches cannot be used directly.