Pages

August 28, 2018

Some observations of doing a bit of data analysis with DBSCAN and pandas in a Jupyter notebook

Sorting Classifications for making graphs-VolunteerClassifications
Here is a Jupyter notebook I was using today to parse the classifications from the Steelpan Vibrations project. I'm leaving some of the notes here as a reminder to myself for the future. (I learned how to put the Jupyter notebook into the blog from this page.)

I really want to share this because in all my reading on using DBSCAN to do cluster analysis, I had a hard time finding any page online that was describing how the coordinates of the points identified in a cluster could be paired with matched data from the larger (original) data set. When I found the solution (see link in the comments between cells below) it was really obvious, but it was painful not knowing even how to google for what I was looking for.

Function to do the cluster identification with DBSCAN:

In [31]:

defdbscan(crds):bad_xy=[]#might need to change thisX=np.array(crds)db=DBSCAN(eps=18,min_samples=3).fit(X)core_samples_mask=np.zeros_like(db.labels_,dtype=bool)core_samples_mask[db.core_sample_indices_]=Truelabels=db.labels_n_clusters_=len(set(labels))-(1if-1inlabelselse0)unique_labels=set(labels)colors=plt.cm.Spectral(np.linspace(0,1,len(unique_labels)))fork,colinzip(unique_labels,colors):ifk==-1:# Black used for noise.col='k'class_member_mask=(labels==k)# These are the definitely "good" xy values.xy=X[class_member_mask&core_samples_mask]plt.plot(xy[:,0],xy[:,1],'o',markerfacecolor=col,markeredgecolor='k',markersize=14)#print("\n Good? xy = ",xy)#print("X = ",X)# These are the "bad" xy values. Note that some maybe-bad and maybe-good are included here.xy=X[class_member_mask&~core_samples_mask]plt.plot(xy[:,0],xy[:,1],'o',markerfacecolor=col,markeredgecolor='k',markersize=6)#print("\n Bad? xy = ",xy)bad_xy.append(xy)plt.title('Estimated number of clusters: %d'%n_clusters_)plt.xlim(0,512)plt.ylim(0,384)clusters=[X[labels==i]foriinrange(n_clusters_)]#print(clusters)#print(db.labels_)returnclusters,labels

Import the classifications into a pandas DataFrame. I'm using header=None because there were no headings in the csv file:

/Users/amorriso/anaconda/lib/python3.6/site-packages/matplotlib/lines.py:1206: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
if self._markerfacecolor != fc:

Check the DataFrame once, and then check it again after renaming the columns: