Abstract

Plain vanilla K-means clustering has proven to be successful in practice, yetit suffers from outlier sensitivity and may produce highly unbalanced clusters.To mitigate both shortcomings, we formulate a joint outlier detection andclustering problem, which assigns a prescribed number of datapoints to anauxiliary outlier cluster and performs cardinality-constrained K-meansclustering on the residual dataset, treating the cluster cardinalities as agiven input. We cast this problem as a mixed-integer linear program (MILP) thatadmits tractable semidefinite and linear programming relaxations. We proposedeterministic rounding schemes that transform the relaxed solutions to feasiblesolutions for the MILP. We also prove that these solutions are optimal in theMILP if a cluster separation condition holds.