The Heterogeneous P-Median Problem for Categorization Based Clustering

Abstract

The p-median offers an alternative to centroid-based clustering algorithms for identifying unobserved categories. However, existing p-median formulations typically require data aggregation into a single proximity matrix, resulting in masked respondent heterogeneity. A proposed three-way formulation of the p-median problem explicitly considers heterogeneity by identifying groups of individual respondents that perceive similar category structures. Three proposed heuristics for the heterogeneous p-median (HPM) are developed and then illustrated in a consumer psychology context using a sample of undergraduate students who performed a sorting task of major U.S. retailers, as well as a through Monte Carlo analysis.

Key words

Notes

Acknowledgements

We wish to thank the Editor, the Associate Editor, and three anonymous reviewers for their constructive comments which have helped improve the contribution and quality of this manuscript.

Appendix: Automatic Search Procedure for Penalty Term δ

Optimizing Z and leaving δ to be estimated always leads to a trivial solution of δ=0. As such, we propose to identify a value of δ by optimizing a somewhat auxiliary objective function h(δ). This function compares the structures obtained by solving the HPM problem Z and the piles provided by each individual participant in the sorting task using the Adjusted Rand Index (Hubert & Arabie, 1985). The shape and space of function h(δ) is not known exactly but assumes known category structures so its derivatives are not available and direct estimation is not possible. However, we attempt to solve the problem of estimating δ by Derivative-Free Optimization (DFO) methods (Conn, Scheinberg, & Vicente, 2009) by using a simplified physics surrogate: a function that approximates the real function. First to evaluate h(δ), for a given δ, we proceed as follows:

1.

Solve HPM in (10) using δ, for all g=1,…,G. Use one application of the constructive heuristic and the improvement heuristic (C+IH). Denote Cg for g=1,…,G, for the solutions obtained.

2.

For all g=1,…,G, compute the Adjusted Rand Index between each individual’s estimated category structure and the piles created. Let \(\mathit{ARI}_{C_{g}}\) be the Adjusted Rand Index obtained using solution Cg.

3.

Let \(h(\delta)=\mathit{average}(\mathit{ARI}_{C_{g}})\).

A.1 Optimizing h(δ)

The proposed DFO consists of a univariate pattern-search method that takes a pair of search directions γ and −γ for h(δ) at each iteration. First, it evaluates the function at a unit step length along each direction. The (candidate) solutions obtained form a frame around the current iterate (i.e. current best solution for h(δ)). If either h(δ−γ) or h(δ+γ) is greater than h(δ) (recall the goal of maximizing ARI), it becomes the new best solution, the center of the frame shifts to this new value of δ, and the frame is augmented. If neither h(δ−γ) or h(δ+γ) improve on h(δ), the frame shrinks. Such iterations repeat until a stopping criterion is met (e.g., number of h(δ) evaluations, minimum size of γ, maximum allowed time).

For certain DFO methods, it is possible to prove global convergence to a stationary point of the function being optimized. The presence of noise and other forms of inexactness may affect the performance of the pattern-search algorithm (e.g., use of an inexact solution for Cg) and convergence. Yet it should suffice to provide a good solution for δ that can be used along with the heuristics presented herein. The algorithm in Figure A1 presents the pseudo-code of the proposed pattern-search method for this automated penalty selection.