Abstract

Clustering offers significant insights in data analysis. Density-based algorithms have emerged as flexible and efficient techniques, able to discover high-quality and potentially irregularly shaped clusters. Here, we present scalable density-based clustering algorithms using random projections. Our clustering methodology achieves a speedup of two orders of magnitude compared with equivalent state-of-art density-based techniques, while offering analytical guarantees on the clustering quality in Euclidean space. Moreover, it does not introduce difficult to set parameters. We provide a comprehensive analysis of our algorithms and comparison with existing density-based algorithms.

Keywords

Notes

Acknowledgements

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007–2013)/ERC Grant Agreement No. 259569.

Appendix

Proof of Theorem 8

Assume for now that the random line used for partitioning is chosen uniformly at random. By definition and using linearity of expectation, the expectation of \(X^{short}_A\) is \({\mathbb {E}}[X^{short}_A]:= \sum _{C\in \mathcal {S}_A\setminus \mathcal {N}(A,c_dr)} p(E^{short}_A(C))\). Using Theorem 7 to bound \(p(E^{short}_A(C))\),

The probability that the random variable \(X^{short}_A\) exceeds the expectation \({\mathbb {E}}[X^{short}_A]\) by a factor \(c_d/\log (c_d)^2\) or more is at most \(\log (c_d)^2/c_d\) using Markov’s inequality. Thus, for the probability of event \(E_0\) as defined below we have

Let us deal with dependencies among chosen lines. We choose \(c_L\cdot log N\) random lines independently from each other. Define \(c_{E'}:=1-p(E')=2\log (c_d)^2/{c_d}\). Then using a Chernoff Bound (see Theorem1), we have that the probability of event \(E_f(S_A)\) that the number of lines for which \(E'\) does not hold for \(\mathcal {S}_A\) exceeds the expectation \(c_L\cdot \log N\cdot c_{E'}\) by at most a factor \(1+\sqrt{3}/(c_L)^{1/4}\) is \(1-1/N^{\sqrt{c_L} c_{E'}}\). Thus, assume that we “reuse” the projections for a total of \(N^{c2}\) sets of points across all partitionings, ie. \(c2<2\). Therefore, given that the projections have been reused \(N^{c2}\) times the probability \(p(E_f(S_C))\) for a set \(S_C\) can be upper bounded using the bound for dependent events from Theorem 2\(1-1/N^{\sqrt{c_L} c_{E'}-c2}\). Thus, the probability for a bad event increases at most by a factor of \(1+\sqrt{3}/(c_L)^{1/4}\) yielding that for \(c_L\) sufficiently large.

Proof of Theorem 9

The idea of the proof is to look at a point A and remove “very” far away points until there are only relatively few of them left. Then, we consider somewhat closer points (but still quite far away) and remove them until we are left with only some very close points and some potentially further away points. Consider a partitioning of set \(\mathcal {S}_A\) into two sets \(\mathcal {S}_0\) and \(\mathcal {S}_{1,A}\), i.e., \(A \in \mathcal {S}_{1,A}\) using algorithm Partition and random projection line L. Assume that the following condition holds for set \(\mathcal {S_A}\): There are many more points “very far away” from A than not so distant points using some factor \(f_d\ge c_d\):

The value \(c_r\) is defined later; we require \(c_r\ge f_d\). We prove that even in this case after a sequence of splittings of the point set only few very far away points end up in set \(\mathcal {S}_{1,A}\). (If there are fewer faraway points than somewhat close points, the probability that many of them end up in the same set is even smaller.) Define event \(E_1\) as follows: A splitting point is picked such that for the subset \(\mathcal {S}_{1,A}\) most very close points from \(\mathcal {N}(A,r) \cap \mathcal {S}_A\) remain, i.e.,

The probability of event \(E_1\) can be bounded as follows. Assume that \(E'\) as defined in Theorem 8 occurs (using \(f_d>c_d\) instead of \(c_d\)), i.e., most distances are scaled roughly by the same factor from a point A to other points. To minimize the probability of \(p(E_1|E')\) we assume that all projected distances from faraway points to A are minimized and those of close points are maximized. This means that at most a fraction \(1/\log {f_d}\) of all very far away points \(\mathcal {S}_{A} \setminus \mathcal {N}(A,f_d\cdot r)\) are below a factor \(3\log (f_d)/f_d\) of their expected length and that the distances to all other points in \(\mathcal {S}_{A} \setminus \mathcal {N}(A,f_d\cdot r)\) are shortened exactly by that factor. We assume the worst possible scenario, i.e., those far away points are split such that they end up in the same set as A, i.e., they become part of \(S_{1,A}\). At most a fraction \(1/(f_d)^{\log (f_d)/3}\) of all very close points \(\mathcal {S}_{A} \cap \mathcal {N}(A,r)\) are above a factor \(\log (f_d)\) of the expectation. We assume that those points behave in the worst possible manner, i.e., the close points exceeding the expectation are split such that they end up in a different set than A, i.e., \(S_{0}\) not \(S_{1,A}\). Next, we bound the probability that no other points from \(\mathcal {S}_{A} \cap \mathcal {N}(A,r)\) are split. If we pick a splitting point among the fraction of \(1-1/\log f_d\) points from \(\mathcal {S}_{A} \setminus \mathcal {N}(A,f_dr)\) that are not shortened by more than a factor \(3\log (f_d)/f_d\), then \(p(E_1|E')\) occurs. By initial assumption we have \((1-1/f_d^{\log (f_d)/3})|\mathcal {S}_{A} \cap \mathcal {N}(A,f_d\cdot r)|\le (1-1/\log {f_d})\cdot c_r |\mathcal {S}_A \setminus \mathcal {N}(A,f_dr)|\) and thus, \(|\mathcal {S}_{A} \setminus \mathcal {N}(A,f_dr)|/|\mathcal {S}_{A}|\le 2/c_r\) for \(1-1/\log f_d > 1/2\), i.e., \(f_d\) sufficiently large, and because \(|\mathcal {S}_{A}| \ge |\mathcal {S}_A \setminus \mathcal {N}(A,f_dr)|\). Put differently, the probability to pick a bad splitting point is at most \(2/c_r\). The occurrence of event \(E'\) reduces the probability of \(E_1\) at most by \(1-p(E')\), i.e., \(p(E_1|E')\ge p(E_1)-(1-p(E'))\).

The probability that the size of the set resulting from the split \(\mathcal {S}_{1,A}\) is at most 2/3 of the original set \(\mathcal {S}_{A}\) is 1 / 3, because a splitting point is chosen uniformly at random. When restricting our choice to far away points \(\mathcal {S}_A\setminus \mathcal {N}(A,f_dr)\), we can use that owing to Condition (2) at most a fraction \(1/c_r\) of all points are not far away. The probability of \(E_2\) given \(E_1\) can be bounded by assuming that all events, i.e., choices of random lines and splitting points, that are excluded owing to the occurrence of \(E_1\) actually would have caused \(E_2\). More precisely, we can subtract the probability of the complementary event of \(E_1\), i.e., \(p(E_2|E_1) = 2/3-1/c_r - (1-p(E_1)) \ge 2/3 - 1/c_r - (1-4\log (c_d)^2/{f_d})^3 \ge 1/4\) for a sufficiently large constant \(f_d\). The initial set \(\mathcal {S}:=\mathcal {P}\) has to be split at most \(c_L\log N\) times until the final set \(\mathcal {S}_A\) containing A (which is not split any further) is computed (see proof of Theorem 3). We denote a trial T as up to \(\log f_d\) splits of a set \(\mathcal {S}\) into two sets. A trial T is successful if after at most \(\log f_d\) splits of a set \(\mathcal {S}_A\) the final set \(\mathcal {S}'_A\subset \mathcal {S}_A\) is of size at most \(|\mathcal {S}_A|/2\) and \(E_1\) occurred for every split. The probability for a successful trial p(T) is equal to the probability that \(E_1\) always occurs and \(E_2\) at least once. This gives:

Starting from the entire point set we need \(\log (N/minSize)+1\) (consecutive) successful trials until a point A is in a set of size less than \(\textit{minSize}\) and the splitting stops. Next we prove that the probability to have that many successful trials is constant given that the required upper bound on the neighborhood holds, i.e., (1). Assume there are \(n_i\) points within distance \([i^{3/2+c_s} \cdot c_d\cdot r, (i+1)^{3/2+c_s}\cdot c_d\cdot r]\) for a positive integer i. In particular, note that the statement holds for arbitrarily positioned points. We do not even require them to be fixed across several trials.

The upper bound on the neighborhood growth (1) yields that \(n_i\le 2^{i^{1/2}}\cdot |\mathcal {N}(A,c_dr)|\). Furthermore, we have that \(\sum _{i=1}^{\infty } n_i \le N\). Next, we analyze how many trials we need to remove points \(n_i\) until only the close points \(\mathcal {N}(A,c_dr)\) remain. We are going from large i to small i, i.e., remove distant points first. For each \(n_i\) we need at most \(\log n_i - \log |\mathcal {N}(A,c_dr)| \le i^{1/2}\) successes. Let \(E_{n_{i}}\) be the event that this happens, i.e., that we have that many consecutive successes.

As the number of points N is finite, the number of \(n_i>0\) is also finite. Let \(m_A\) be the largest value such that \(n_{m_A}>0\). Let \(p_A:=p(\wedge _{i\in [1,m_A]} {E_{n_{i}}})\) be the probability that all trials for all \(n_i\) in \(i \in [1,m_A]\) and \(n_i>0\) are successful. Note that the events \(E_{n_i}\) are not independent for a fixed point set \(\mathcal {P}\). However, the bound (4) on \(p(E_{n_{i}})\) holds as long as condition 2 is fulfilled, i.e., for an arbitrary point set. Put differently, the bound (4) holds even for the “worst” distribution of points. Therefore, we have that \(p_A:=p(\wedge _{i\in [1,m_A]} {E_{n_{i}}})\ge \prod _{i\in [1,m_A]} 2^{-1/(i\cdot c_d)}\) using stochastic domination. Note that our choice of maximizing \(n_i\), i.e., the number of required successful trials for \(E_{n_i}\) minimizes the probability of a \(p(E_{n_i})\). This is quite intuitive, since it says that we should maximize the number of points closest to A that should not be placed in the same set as A (i.e., they are just a bit too far to yield the claimed approximation guarantee). We also need to be aware of the fact that the distribution for the \(n_i\) under the constraint that \(\sum _{i=1}^{m_A} n_i \le N\) should minimize the bound for \(p_A\). It is also apparent from the derivation of (4) that this happens when we maximize \(n_i\); the probability for \(p_A\) decreases more if we maximize small i. Essentially, this follows from line 2 in (4) because the number of trials \(n_T\) is less than \(\sqrt{i}\) and each trial is successful with probability of \((1-1/i^{3/2})\) (focusing on dominating terms), yielding an overall success probability of \((1-1/i^{3/2})^{n_T}\) for a trial. Thus, \((1-1/i^{3/2})^{\sqrt{i}}> (1-1/l^{3/2})^{\sqrt{l}}\) for \(1<i<l\). Put differently, choosing \(n_i\) large for a large i is not a problem for our algorithm, because it is unlikely that these points will be projected in between the nearest points to A.

Therefore, when maximizing the number of points close to A, we have that \(m_A=(\log N)^2\), i.e., all \(n_i\) for \(i>(\log N)^2\) are 0 because \(2^{\sqrt{(\log N)^2}}=n_{(\log N)^2}=N\). Additionally, note that we need at most \(c_8 \log N\) trials in total. As each trial slices the number of points by 1 / 2, we only need to take into the account the subset \(X\in [1,m_A]\) for which the number of points doubles, i.e., \(n_j=2\cdot n_{i}\), for \(n_i=2^{i^{1/2}}\). This happens whenever \(i^{1/2}\) is an integer, i.e., for \(i=1,4,9,16,\ldots \), we get \(n_i=1,2,3,4,\ldots \). Thus, we only need to look at \(i^2 \in [1,m_A]\)

Thus, when doing \( c_p (\log N)\) partitionings, we have at least \(c_p/16 \log N\) successes for point A whp using Theorem 1 and \(c_d\ge 1\). This also holds for all points whp using Theorem 2.

Finally, let us bound the number of nearby points that remain. We need at most \(c_L \log N\) (see Theorem 3) projections until a point set will not be split further. Each projection reduces the points \(|\mathcal {N}(A,r)|\) at most by factor \(1-1/c_r\). We give a bound in two steps, i.e., for \(c_r\ge \log ^3 N\) and \(c_r \in [ c_d, \log ^3 N]\).

To reduce the number of points by a factor of \(\log ^3 N\) requires \(3\cdot \log \log N \) trials, each reducing the set by a factor 1/2. Thus, trial i is conducted using a factor \(c_r = \log ^3 N / 2^i\) of the original points or, equivalently, trial \(3\cdot \log \log N-i\) is conducted with \(c_r= 2^i\). Thus, in total the fraction of remaining points in \(\mathcal {N}(A,r)\) is

Proof of Theorem 10

First we bound the number of neighbors. Using Theorem 9 we obtain \(c_p/16 (\log N)\) sets \(\mathcal {S}_A\) containing A. Define \(\mathfrak {S}_A\) to be the union of all sets \(\mathcal {S}_A \in \mathfrak {S}\) containing A. Before the last split of a set \(\mathcal {S}_A\) resulting in the sets \(\mathcal {S}_{1,A}\) and \(\mathcal {S}_2\), the set \(\mathcal {S}_A\) must be of size at least \(c_m\cdot \textit{minPts}\); the probability that splitting it at a random point results in a set \(\mathcal {S}_{1,A}\) with \(|\mathcal {S}_A|<c_m/2\cdot \textit{minPts}\) is at most 1/2. Thus, using a Chernoff bound (Theorem 1), at least \(c_p/128\log N\) sets \(\mathcal {S}_A \in \mathfrak {S}_A\) are of size at least \(c_m/2\cdot \textit{minPts}\) whp.

Let \(\mathcal {S}_A\) be a set \(\mathcal {S}_A\) with size at least \(c_m/2\cdot \textit{minPts}\). Consider the process when the neighborhood \(\mathcal {N}(A)\) is built by inspecting one set \(\mathcal {S}_A\) after the other. Assume that the number of neighbors \(|\mathcal {N}(A)|< c_m/2 minPts/(2{c_d})\). Thus, the probability of event \(p(\text {Choose new close neighbor } B)=p(B \not \in \mathcal {N}(A) \wedge B \in \mathcal {N}(A,r))\) that a point \(B \in \mathcal {S}_A\) but not already in \(\mathcal {N}(A)\) is chosen from \(\mathcal {N}(A,r) \cap \mathcal {S}_A\) is at least \(c_m/(4{c_d})\).

As by assumption \(\textit{minPts} < c_m \log N\) and there are at least \(c_p/128\log N\) sets \(\mathcal {S}_A\) with \(|\mathcal {S}_A|\ge c_m/2\cdot \textit{minPts}\) and \(c_p \ge c_m\cdot 128\), using the Chernoff bound in Theorem 1 we get that there are at least \(c_m/(4{c_d}) minPts\) points within distance \(D_{c_m \textit{minPts}}(A)\) in \(\mathcal {N}(A)\) whp for every point A. Setting \(c_m\ge 8 c_d\) completes the proof. \(\square \)

Proof of Theorem 12

To compute Davg(A) with \(f=1\) we consider the \((1+f)\cdot \textit{minPts}= 2 minPts \) closest points to A from \(\mathcal {N}(A)\). Using Theorem 10 2minPts points are contained in \(\mathcal {N}(A)\) with distance at most \(D_{c_m \textit{minPts}}(A)\). This yields \(Davg(A)\le D_{c_m \textit{minPts}}(A)\). Thus, the upper bound follows. To compute Davg(A), we average the distance using the \(2\cdot \textit{minPts}\)-closest points to A. The smallest value of Davg(A) is reached when \(\mathcal {N}(A)\) contains all \(2\cdot \textit{minPts}\) closest points to A, which implies \(Davg(A)\ge D_{2\cdot \textit{minPts}}(A)\ge D_{ \textit{minPts}}(A)\) for any set of neighbors \(\mathcal {N}(A)\). \(\square \)