I am working on a clustering project where we have collected protein data from over 100 patients samples. This data is normalized and log transformed to achieve a uniform distribution. The goal is to cluster samples based upon their similarities, I am using hierarchal clustering and trying out combinations of distance metrics and clustering algorithms. (We haven't made a decision on distance method or clustering algorithms) My question is related to the centering and scaling issue. Is it absolutely necessary to both scale and center the data?, even in scenarios where all the data is coming from the same platform and with same units of measurement.

Scaling is only necessary when you are combining data of different types, like height and weight for example.

Centering is done in principal component analysis for instance, it is not needed for clustering as it will not effect the results.

You could also consider trying kmeans with the silhouette method or the GAP-statistic. Clustering techniques tend to work better if the clusters are roughly spherical in N dimensional sample space, but you can also run them for uniformly distributed data. K means divides two uniformly distributed clusters here very well and I expect hclust would be fine too. I am talking about how samples are distributed in state space here, not how feature or sample 'signal' is distributed.

Finally, it is important your data is homoscedastic, so the variance of each feature does not depend on the mean. This isn't a problem with microarray data, but for RNA-seq it is as we have to transform the data appropriately.