There are various ways to cluster data. Some require the data first to be scaled to have a mean of $0$ and standard deviation of $1$. However, others do not mention if the data should be scaled at all. This lead me to think that some of these methods requrie dimesnionality reduction first or we need to use the distance matrix. Honestly, I am a bit lost given that there are so many different clustering algorithms.

I have noisy data that is unsupervised. I am trying to see how many clusters exist. I have 415 observations and 46 variables, so lots of dimensions that are not normally distributed. Also, I am using R.

The first step I did was to scale the data to have mean of $0$ and standard deviation of $1$. I used the NBclustpackage to find 4 clusters due to the majority rule and the mClust package to find 3 clusters from the mclustBIC() function.

Then I tried fuzzy clustering with 3 and 4 clusters that showed some interesting things. Spefically, I used Squared Euclidean distances rather than Euclidean or Manhattan because each point was coming up with a probability of 1/k such that k is my number of clusters I passed through in the funciton.

My next step was to try using dbscan from the dbscan package rather than fpc. I found the eps value to be roughly 7 using a KNN of 4 (how to determine the optimal value without looking would be ideal but otherwise I guessed). Then I ran the function dbscan() and it found 2 clusters with an eps=7 and minPts=47.

Overall, I like the idea of using fuzzy-clustering and Squared Euclidean distance, as it gives me a usable output, but I do not have any justificaiton for that distance method. Where might I go find justification? Are there assumptions to using fuzzy-clustering data like a normal distribution of the data?

Secondly, for dbscan. I know there exists OPTICS and hierarchical dbscan. I tried running those as well, but only found 2 clusters. I thought I could pass through a distance matrix but whenver I computed it from the scaled data, I get a distance vector. Does finding the distance matrix seem to be the correct way? If so, how do I go about finding it the right way?

Third, thank you for reaching this far. I understand this is a complex question given that I have not posted any data, but I am looking for guidance.

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

1

$\begingroup$1. Can you please plot your data? 2.Clustering is ultimately concerned with structure discovery. It is not the case of having 3, 4, or 16 clusters of our dataset but rather the case of having 3, 4 or 16 comprehensible and reproducible segmentations of our dataset. 3. You are moving in the correct direction. You have discovered the often omitted truism that a clustering is as good as the distance metric used. Keep labouring on it! 4. It is unclear/too broad what you ask at the moment... (+1 for efforts though!)$\endgroup$
– usεr11852Nov 11 '18 at 23:29

$\begingroup$Okay. I'll be honest. When I plot the data, you cannot really discern any different clusters. Secondly, I can perform cluster validation, but it keeps indicating that the optimal number of clusters is 1. I feel like I should reduce the number of variables. How can I do that through PCA? Don't most algorithms do that automatically? Or should I create some sort of distance matrix?$\endgroup$
– Jack ArmstrongNov 12 '18 at 19:50

1

$\begingroup$What if there really is just 1 cluster? Did DBSCAN find two clusters, or 1 cluster plus unclustered noise?$\endgroup$
– Anony-MousseDec 3 '18 at 0:40