Clustering Analysis
Introduction of Cluster Analysis
What is Clustering?
-
Clustering is partitioning (dividing up) the population of data points into groups that are more similar to other data points in the same group
-
Helps identify two qualities of data:
- Meaningfulness
- Usefulness
Why Clustering over Classification?
- Clustering can adapt to changes made and helps single out useful features that differentiate different groups
Examples of using Clustering
Data | Attribute Set (x) | Clustering Task |
---|---|---|
Customer | Demographic and Location Information | Group together similar customers |
Word Document | Words | Group Documents based on similar words |
Ads | Number of different Ads | Group together similar people and target them |
Fantasy Football | Performance Data | Group together similar players to compare |
Clustering Methods
-
Partitioning Method
- K-Means all all its variants (Fuzzy k-means aka similar words)
-
Support Vector Clustering
-
Model-Based Method
- Heirarchial Method
K-Means Clustering
-
Unsuprervised Machine Learning Technique
-
One of the more popular “clustering” algorithms, K-Means stores centroids that it uses to define clusters. A point is considered to be in a particular cluster if its close to that cluster’s centroid than any other centroid.
K-Means Algorithm
Steps: (Pseudocode)
- Specify the number of k clusters to assign.
- Randomly pick number of k centroids.
- Repeat
- Expectation: Assign each point to its closet centroids.
- Maximization: Compute the new centroid (mean) of each cluster.
- Until The centroid positions do not change.
Note: The quality of the cluster assignment is determined by computing the sum of the squared error (SSE) after the centroids converge.
Example of K-Means Algorithm in Python
Import Libraries
from sklearn.cluster import KMeans
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
- Load the input data
- Using NBA data with player age and salary! (note some salaries are not correct )
data = pd.read_csv("salary_nba.csv")
data
Output:
Name | Age | Salary | |
---|---|---|---|
0 | Lebron James | 36 | 40000000 |
1 | Steph Curry | 31 | 41000000 |
2 | Kevin Durant | 33 | 39000000 |
3 | Chris Paul | 35 | 41500000 |
4 | Anthony Davis | 28 | 33000000 |
5 | Dennis Schroder | 27 | 15500000 |
6 | KCP | 28 | 12000000 |
7 | Montrezl Harrell | 27 | 16000000 |
8 | Alex Caruso | 25 | 14750000 |
9 | Kyle Kuzma | 27 | 12960000 |
10 | Markieff Morris | 25 | 2300000 |
11 | Alfonzo Mckinnie | 28 | 13000000 |
12 | Talen Horton-Tucker | 20 | 1500000 |
13 | Kostas Antetokounmpo | 23 | 1000000 |
14 | Devontae Cacok | 24 | 1000000 |
15 | Damian Jones | 21 | 785000 |
Lets plot the Age and Salary
plt.scatter(data.Age,data['Salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
Output:
km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(data[['Age','Salary']])
y_predicted
Output:
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2], dtype=int32)
data['cluster']=y_predicted
data.head()
Output:
Name | Age | Salary | cluster | |
---|---|---|---|---|
0 | Lebron James | 36 | 40000000 | 1 |
1 | Steph Curry | 31 | 41000000 | 1 |
2 | Kevin Durant | 33 | 39000000 | 1 |
3 | Chris Paul | 35 | 41500000 | 1 |
4 | Anthony Davis | 28 | 33000000 | 1 |
## Location of "Cluster Centers (x,y)"
kmeans.cluster_centers_
Output:
array([[2.7000e+01, 1.4035e+07], [3.2600e+01, 3.8900e+07], [2.2600e+01, 1.3170e+06]])
data1 = data[data.cluster==0]
data2 = data[data.cluster==1]
data3 = data[data.cluster==2]
plt.scatter(data1.Age,data1['Salary'],color='blue')
plt.scatter(data2.Age,data2['Salary'],color='orange')
plt.scatter(data3.Age,data3['Salary'],color='green')
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],color='purple',marker='*',label='centroid')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.legend()
Output:
Errors with K-Means
- K-Means Clustering was designed to minimize the sum of square error (SSE)
-
SSE measures the sum-of-squared distance between every data point x to their cluster centroid c
- SSE can be used to determine the number of clusters
Lets use the Sum of Squared Error (SEE) to plot an Elbow Plot
sse = []
k_range = range(1,10)
for k in k_range:
km = KMeans(n_clusters=k)
km.fit(data[['Age','Salary']])
sse.append(km.inertia_)
## label and plot the Elbow Graph
plt.xlabel('K')
plt.ylabel('Sum of squared error')
plt.plot(k_range,sse)
Output:
We can use the Elbow Graph to determine the number of clusters we want!
More Examples using Clustering (another NBA Example)
Lets look at the Top 50 NBA Scorers!!
Lets open our File:
data = pd.read_csv('top_50_NBA.csv')
Lets use Clustering to cluster Points and Usage categories.
kmeans = KMeans(n_clusters = 4, random_state = 98)
x = np.column_stack((data['PTS'], data['USG%']))
## Fit to our Kmeans
kmeans.fit(x)
y_kmeans = kmeans.predict(x)
for i, j in zip(df_counting['Player'], y_kmeans):
print(i, j)
Output:
James Harden 2
Paul George 1
Giannis Antetokounmpo 1
Joel Embiid 1
Stephen Curry 1
Kawhi Leonard 1
Devin Booker 1
Kevin Durant 1
Damian Lillard 1
Bradley Beal 1
Kemba Walker 1
Blake Griffin 1
Karl-Anthony Towns 1
Kyrie Irving 1
Donovan Mitchell 1
Zach LaVine 1
Russell Westbrook 1
Klay Thompson 3
Julius Randle 3
LaMarcus Aldridge 3
Jrue Holiday 3
DeMar DeRozan 3
Luka Doncic 3
Mike Conley 3
DAngelo Russell 3
CJ McCollum 3
Nikola Vucevic 3
Buddy Hield 3
Nikola Jokic 3
Tobias Harris 0
Lou Williams 3
Danilo Gallinari 0
John Collins 0
Trae Young 3
Jimmy Butler 0
Kyle Kuzma 0
Khris Middleton 0
Jamal Murray 0
Andrew Wiggins 0
J.J. Redick 0
Tim Hardaway 0
Bojan Bogdanovic 0
Andre Drummond 0
DeAaron Fox 0
Ben Simmons 0
Pascal Siakam 0
Spencer Dinwiddie 0
Jordan Clarkson 3
Collin Sexton 0
Clint Capela 0
Graph the Clustered Points and Usage
points_usage_clustered, ax = plt.subplots()
# List of the 4 Clusters
cluster_1 = []
cluster_2 = []
cluster_3 = []
cluster_4 = []
# Add Cluster to each desinated cluster list above
for i in range(len(y_kmeans)):
if(y_kmeans[i] == 0):
cluster_1.append(x[i])
elif(y_kmeans[i] == 1):
cluster_2.append(x[i])
elif(y_kmeans[i] == 2):
cluster_3.append(x[i])
elif(y_kmeans[i] == 3):
cluster_4.append(x[i])
# vstack stacks things refer to numpy doc --> https://numpy.org/doc/stable/reference/generated/numpy.vstack.html
#vstacks is an array so array([[21.5, 25.6],[points, usage(%)],..])
cluster_1 = np.vstack(cluster_1)
cluster_2 = np.vstack(cluster_2)
cluster_3 = np.vstack(cluster_3)
cluster_4 = np.vstack(cluster_4)
# ax.scatter(x,y, label = 'name of what were plotting', s = size )
ax.scatter(cluster_1[:, 0], cluster_1[:, 1], label = "Cluster 1 (Bottom Top 50)")
ax.scatter(cluster_2[:, 0], cluster_2[:, 1], label = "Cluster 2 (Top Scorers in NBA)")
ax.scatter(cluster_3[:, 0], cluster_3[:, 1], label = "Cluster 3 (James Freaking Harden)")
ax.scatter(cluster_4[:, 0], cluster_4[:, 1], label = "Cluster 4 (Middle Top 50 Scorers) ")
## Heres the centroids! array([[20.56666667, 27.84666667], [x,y],..])
centers = kmeans.cluster_centers_
## Now we can plot the centroids (4 of them)
ax.scatter(centers[:, 0], centers[:, 1], c = 'black', s = 200, alpha = .5, label = 'Cluster Center')
# lets plot the Legend, which will include the labels for each of our 4 clusters above
ax.legend(loc ='best', prop = {'size': 8})
## Labeling the x-axis and y-axis
ax.set_xlabel('Points Per Game')
ax.set_ylabel('Usage (%)')
## Naming our Scattered Plot!!!!
points_usage_clustered.suptitle("Clustered Points and Usage")
plt.show()
Output:
Limitations of K-Means
When are there problems with K-Means Clustering?
- When clusters are of differing Sizes, Densities, and Non-Globular Shapes (think dount shaped data)
- When there are outliers in the Data (James Harden in the example above)
Part 2 of Cluster Analysis
Cluster Validity
-
For supervised classification have different ways to measure how good our model is!
- Accuracy, True Positive Rate
How does one evalute the “goodness” of the resulting Clusters?
-
Isn’t a easy answer because “clusters” are in the eye of the beholder!
-
Notion of a cluster can be Ambiguous
Issues in Cluster Validation
Typical Questions to answer:
- How many Clusters are there in the data?
- Are the clusters real or are they nothing more than “accidental” groupings of the data
We need some sort of statistical framework to interpet a measure for Cluster Validity
- To measure cluster quality and a statistical approach for testing vality of the clusters
Rand Index
What is the Rand Index? (External Measure)
- Rand Index or Rand Measure in statistics (data clustering), is a measure of the similarity between two two clustering.
Python Examples of Cluster Validation (using the Rand Index)
import pandas as pd
data = pd.read_csv('diabetes.csv',header='infer')
data.head()
Output:
preg | plas | pres | skin | insu | mass | pedi | age | class | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | tested_positive |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | tested_negative |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | tested_positive |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | tested_negative |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | tested_positive |
Y = (data['class'] == 'tested_positive')*1
X = data.drop('class',axis=1)
from sklearn import cluster, metrics
## group into 2 clusters tested positive and tested negative
k_means = cluster.KMeans(n_clusters = 2)
# Fit
k_means.fit(X)
Confusion Matrix
metrics.confusion_matrix(Y,k_means.labels_)
Output:
array([[421, 79],
[182, 86]])
Adjusted Rand Score
ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)
metrics.adjusted_rand_score(Y,k_means.labels_)
Output:
Adjusted Rand Index: 0.07438695547529094
Hierarchical Clustering
- Hierachical Clustering produces nested clusters organized as a tree!
- It has the advantage of not having a pre-defined number of clusters, but it doesn’t work well when we have huge amounts of data!
Python Examples of Hierarchical Clustering
import pandas as pd
data = pd.read_csv('animals.csv')
names = data['Name']
Y = data['Class']
## Drop Name Column and Class (predictor) Column
X = data.drop(['Name','Class'],axis=1)
data
Output:
Name | warmBlooded | coldBlooded | GiveBirth | LayEggs | CanSwim | CanFly | HasLegs | HasScales | Class | |
---|---|---|---|---|---|---|---|---|---|---|
0 | human | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | mammal |
1 | python | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | reptile |
2 | salmon | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | fish |
3 | whale | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | mammal |
4 | frog | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | amphibian |
5 | komodo dragon | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | reptile |
6 | bat | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | mammal |
7 | pigeon | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | bird |
8 | cat | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | mammal |
9 | leopard shark | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | fish |
10 | turtle | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | reptile |
11 | penguin | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | bird |
12 | porcupine | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | mammal |
13 | eel | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | fish |
14 | salamander | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | amphibian |
15 | alligator | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | reptile |
Import these Libraries
import numpy as np
# convert X to a matrix using nunpy
X.to_numpy()
X
Output:
array([[1, 0, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 1, 0, 0, 0, 1],
[0, 1, 0, 1, 1, 0, 0, 1],
[1, 0, 1, 0, 1, 0, 0, 0],
[0, 1, 0, 1, 1, 0, 1, 0],
[0, 1, 0, 1, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 1, 1, 0],
[1, 0, 0, 1, 0, 1, 1, 0],
[1, 0, 1, 0, 0, 0, 1, 0],
[0, 1, 1, 0, 1, 0, 0, 1],
[0, 1, 0, 1, 1, 0, 1, 1],
[1, 0, 0, 1, 1, 0, 1, 0],
[1, 0, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 1, 1, 0, 0, 1],
[0, 1, 0, 1, 1, 0, 1, 0],
[0, 1, 0, 1, 1, 0, 1, 1]])
Intro to Scipy:
scipy.cluster.hierarchy.linkage( y, Method , Metric, optimal_ordering)
y: May be either a 1-D condensed distance matrix or a 2-D array of observation vectors
Method(s): Used to compute the distance d(s,t) between the 2 clusters s and t (Inter-Cluster Proximity)
- method = ‘single’ (or Min, shortest distance)
- method = ‘complete’ (or Max, largest distance)
- method = ‘average’
- method = ‘ward’ (Proximity of two clusters is based on the increase in the SSE when two clusters are merged)
- method = ‘weighted’
- method = ‘centroid’
- method = ‘median’
Metric: The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. optimal_ordering: If True, the linkage matrix will be reordered so that the distance between successive leaves is minimal
Note that method, metric and optimal_ordering are all optional
Returns:
- a hierarchical clustering encoded as a linkage matrix.
How about hierarchy.dendrogram ?
Parameters:
- Z: The linkage matrix encoding the hierarchical clustering to render as a dendrogram
-
Orientation: The direction to plot the dendrogram, which can be any of the following strings:
- top, bottom, left, right
- labels: text
Lets start by importing the Libraries needed!
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt
%matplotlib inline
from pandas import Series
import numpy as np
Z = hierarchy.linkage(X.to_numpy(), 'complete')
dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation='left')
Output:
Lets check to see the number of Clusters given the threshold (t = 1.1)
Z = hierarchy.linkage(X.to_numpy(), 'complete')
labels = hierarchy.fcluster(Z, t = 1.1)
labels
Output:
array([1, 4, 5, 2, 3, 4, 1, 1, 1, 2, 6, 1, 1, 5, 3, 6], dtype=int32)
Rand Score
What is Y again?
Y
Output:
0 mammal
1 reptile
2 fish
3 mammal
4 amphibian
5 reptile
6 mammal
7 bird
8 mammal
9 fish
10 reptile
11 bird
12 mammal
13 fish
14 amphibian
15 reptile
Name: Class, dtype: object
Adjusted Rand Score
adjusted_rand_score(labels_true, labels_predicted)
metrics.adjusted_rand_score(Y,labels)
Output:
Rand Score: 0.4411764705882353
Adjusting the Threshold to (t=1)
What does the fcluster to?
- Form flat clusters from the hierarchical clustering defined by the given linkage matrix.
Z = hierarchy.linkage(X.to_numpy(), 'complete')
labels = hierarchy.fcluster(Z, t=1)
labels
Output:
array([1, 5, 6, 3, 4, 5, 1, 2, 1, 3, 7, 2, 1, 6, 4, 7], dtype=int32)
Adjusted Rand Score
metrics.adjusted_rand_score(Y,labels)
Output:
Rand Score: 0.6180555555555556
More Examples of Clustering…. (Cluster Analysis)
import pandas as pd
data = pd.read_csv('traffic.csv')
data.head()
date | 0:00 | 0:05 | 0:10 | 0:15 | 0:20 | 0:25 | 0:30 | 0:35 | 0:40 | … | 23:15 | 23:20 | 23:25 | 23:30 | 23:35 | 23:40 | 23:45 | 23:50 | 23:55 | class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4/10/2005 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | none |
1 | 4/11/2005 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | 12.0 | 6.0 | 8.0 | 11.0 | 7.0 | 2.0 | 3.0 | 6.0 | 8.0 | none |
2 | 4/12/2005 | 3.0 | 8.0 | 10.0 | 6.0 | 1.0 | 4.0 | 9.0 | 4.0 | 6.0 | … | 4.0 | 10.0 | 2.0 | 6.0 | 5.0 | 4.0 | 5.0 | 6.0 | 7.0 | afternoon |
3 | 4/13/2005 | 6.0 | 5.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 2.0 | 12.0 | … | 7.0 | 15.0 | 10.0 | 4.0 | 10.0 | 6.0 | 10.0 | 6.0 | 16.0 | evening |
4 | 4/14/2005 | 7.0 | 3.0 | 6.0 | 11.0 | 8.0 | 6.0 | 6.0 | 10.0 | 4.0 | … | 5.0 | 9.0 | 4.0 | 4.0 | 6.0 | 9.0 | 5.0 | 16.0 | 8.0 | none |
5 rows × 290 columns
Lets clean up this data!!
X = data.dropt('data', axis = 1 )
y = data['class']
y = y.map({'none': 0, 'afternoon': 1, 'evening': 2})
X = X.drop('class',axis=1)
X.head()
0:00 | 0:05 | 0:10 | 0:15 | 0:20 | 0:25 | 0:30 | 0:35 | 0:40 | 0:45 | … | 23:10 | 23:15 | 23:20 | 23:25 | 23:30 | 23:35 | 23:40 | 23:45 | 23:50 | 23:55 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | … | 12.0 | 12.0 | 6.0 | 8.0 | 11.0 | 7.0 | 2.0 | 3.0 | 6.0 | 8.0 |
2 | 3.0 | 8.0 | 10.0 | 6.0 | 1.0 | 4.0 | 9.0 | 4.0 | 6.0 | 13.0 | … | 10.0 | 4.0 | 10.0 | 2.0 | 6.0 | 5.0 | 4.0 | 5.0 | 6.0 | 7.0 |
3 | 6.0 | 5.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 | 2.0 | 12.0 | 2.0 | … | 7.0 | 7.0 | 15.0 | 10.0 | 4.0 | 10.0 | 6.0 | 10.0 | 6.0 | 16.0 |
4 | 7.0 | 3.0 | 6.0 | 11.0 | 8.0 | 6.0 | 6.0 | 10.0 | 4.0 | 7.0 | … | 12.0 | 5.0 | 9.0 | 4.0 | 4.0 | 6.0 | 9.0 | 5.0 | 16.0 | 8.0 |
Lets replace any empty (missing values) with the mean value
mean_value = X.mean()
X = X.fillna(mean_value)
X.head()
Output:
0:00 | 0:05 | 0:10 | 0:15 | 0:20 | 0:25 | 0:30 | 0:35 | 0:40 | 0:45 | 23:10 | 23:15 | 23:20 | 23:25 | 23:30 | 23:35 | 23:40 | 23:45 | 23:50 | 23:55 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8.472727 | 7.969697 | 7.343558 | 7.208589 | 7.006135 | 6.411043 | 6.349693 | 6.705521 | 5.98773 | 5.460123 | … | 14.763636 | 14.042424 | 13.551515 | 12.151515 | 11.684848 | 10.345455 | 10.575758 | 9.321212 | 9.351515 | 8.884848 |
1 | 8.472727 | 7.969697 | 7.343558 | 7.208589 | 7.006135 | 6.411043 | 6.349693 | 6.705521 | 5.98773 | 5.460123 | … | 12.000000 | 12.000000 | 6.000000 | 8.000000 | 11.000000 | 7.000000 | 2.000000 | 3.000000 | 6.000000 | 8.000000 |
2 | 3.000000 | 8.000000 | 10.000000 | 6.000000 | 1.000000 | 4.000000 | 9.000000 | 4.000000 | 6.00000 | 13.000000 | … | 10.000000 | 4.000000 | 10.000000 | 2.000000 | 6.000000 | 5.000000 | 4.000000 | 5.000000 | 6.000000 | 7.000000 |
3 | 6.000000 | 5.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 2.000000 | 12.00000 | 2.000000 | … | 7.000000 | 7.000000 | 15.000000 | 10.000000 | 4.000000 | 10.000000 | 6.000000 | 10.000000 | 6.000000 | 16.000000 |
4 | 7.000000 | 3.000000 | 6.000000 | 11.000000 | 8.000000 | 6.000000 | 6.000000 | 10.000000 | 4.00000 | 7.000000 | … | 12.000000 | 5.000000 | 9.000000 | 4.000000 | 4.000000 | 6.000000 | 9.000000 | 5.000000 | 16.000000 | 8.000000 |
Clustering now
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X)
values = k_means.cluster_centers_
labels = k_means.labels_
Confusion Matrix
from sklearn import metrics
metrics.confusion_matrix(y, k_means.labels_)
Output:
array([[ 0, 71, 23],
[ 0, 4, 19],
[36, 13, 9]])
Lets find the Rand Score
metrics.adjusted_rand_score(y,k_means.labels_)
Output:
0.3557396125157386
Lets plot
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(k_means.cluster_centers_.T)
plt.legend(['Cluster 0','Cluster 1','Cluster 2'])
Output:
Another Example of K-Means Clustering Algorithm
Here we will write the code for a variation of k-means clustering algorithm known as bisecting k-means.
This is how the algorithm works:
Suppose you want to create 4 clusters. The algorithm will first apply standard k-means algorithm to bisect the entire data into 2 clusters, say, C1 and C2.
It then computes the sum-of-squared error (SSE) of each cluster and choose the cluster with higher SSE to be further partitioned into 2 smaller clusters.
For example, if cluster C1 has larger SSE than C2, the algorithm will apply standard k-means to all the data points in cluster C1 and divide the cluster into 2 smaller clusters, say, C1a and C1b. At this time, you have 3 clusters: C1a, C1b, and C2.
Next, you will compare the SSE of the 3 clusters and choose the one with highest SSE. You will bisect the cluster with highest SSE into 2 smaller clusters, thereby creating the 4 clusters needed.
Lets first download the dataset here:
a) Lets load our csv file into a Pandas DataFrame Object to work with. Lets assign the names of its two columns as ‘x1’ and ‘x2’.
import pandas as pd
import matplotlib
%matplotlib inline
data = pd.read_csv("2d_data.csv", header= None, names = ['x1','x2'])
#plot scatter graph
data.plot.scatter(x=0,y=1, label = 'data', color = 'green' )
data.head()
Output:
x1 | x2 | |
---|---|---|
0 | 1.1700 | 1.2809 |
1 | 1.5799 | 0.6373 |
2 | 0.2857 | 0.6620 |
3 | 1.2726 | 0.7440 |
4 | 1.1008 | 0.0689 |
b) Next lets write the function bisect() that will take in three input arguments: data to be clustered, k (number of clusters), and seed (for random number generator). The function should return the cluster labels as output.
from sklearn import cluster
import numpy as np
def bisect(data, k, seed = 1):
labels = pd.Series(np.zeros(data.shape[0]))
SSE = []
k_means = cluster.KMeans(n_clusters=1, random_state=seed)
k_means.fit(data)
SSE.append(k_means.inertia_)
print('Iteration 0 SSE =', SSE)
clusterID = [np.arange(0,data.shape[0]).tolist()]
for numClusters in np.arange(1,k):
s_data = data.iloc[clusterID[SSE.index(max(SSE))]]
k_means = cluster.KMeans(n_clusters=2, random_state=seed)
k_means.fit(s_data)
label = [[],[]]
for i in range(0, s_data.shape[0]):
if k_means.labels_[i]==0:
label[0].append(clusterID[SSE.index(max(SSE))][i])
elif k_means.labels_[i]==1:
label[1].append(clusterID[SSE.index(max(SSE))][i])
for j in range(0, 2):
if j == 0:
labels[label[j]] = labels[label[j][0]]
elif j == 1:
labels[label[j]] = numClusters
clusterID.append(label[j])
SSE.append(((s_data[k_means.labels_==j]-k_means.cluster_centers_[j])**2).sum().sum())
clusterID.pop(SSE.index(max(SSE)))
SSE.pop(SSE.index(max(SSE)))
print('Iteration', numClusters, ' SSE =', SSE)
return labels
c) Apply the bisecting k-means algorithm to generate 8 clusters from the data. Assign the clustering result as another column, named ‘labels’ of the dataframe. Draw a scatter plot of the data using the cluster labels as color of the data points in the scatter plot.
data['labels'] = bisect(data, 8, 1)
data.plot.scatter(x='x1',y='x2',c='labels',colormap='jet')
Output:
More examples
The goal for this example is to predict wheather the office room is occupied (1) or not (0).
import pandas as p
data = p.read_csv('train.csv')
data.head()
Output:
date | Temperature | Humidity | Light | CO2 | HumidityRatio | Occupancy | |
---|---|---|---|---|---|---|---|
0 | 2/4/2015 18:07 | 23.000 | 27.200 | 0.0 | 681.5 | 0.004728 | 0 |
1 | 2/4/2015 18:16 | 22.790 | 27.445 | 0.0 | 689.0 | 0.004710 | 0 |
2 | 2/4/2015 19:05 | 22.245 | 27.290 | 0.0 | 602.5 | 0.004530 | 0 |
3 | 2/4/2015 19:06 | 22.200 | 27.290 | 0.0 | 598.0 | 0.004517 | 0 |
4 | 2/4/2015 19:12 | 22.200 | 27.290 | 0.0 | 593.5 | 0.004517 | 0 |
Drop any useless columns (data) and save X as our columns of intrest and Y as our predictor
data = data.drop('date',axis=1)
Y = data['Occupancy']
X = data.drop('Occupancy',axis=1)
Now lets train the following classifiers (using 5-fold cross validation) on the data above and calculate the average accuracy of the cross-validation for each method (Decision Tree, K-Nearest Neighbor, Logistic Regression). Vary the hyperparameter of the classifier and draw a plot that shows the average cross-validation accuracy versus hyperparamter values for each classification method shown below:
1) Decision Tree (MaxDepth = 1,5,10,50,100)
2) K-Nearest Neigbor (k = 1,2,3,4,5,10,15)
3) Logestic Regression (C = 0.001, 0.01, 0.1, 0.5, 1)
Decision Tree Classifier
import numpy as np
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
%matplotlib inline
maxdepths = [1, 5, 10, 50, 100]
treeAcc = []
for depth in maxdepths:
clf = tree.DecisionTreeClassifier(max_depth=depth)
scores = cross_val_score(clf, X, Y, cv=5)
treeAcc.append(scores.mean())
plt.plot(maxdepths, treeAcc, 'ro-')
plt.xlabel('Maximum depth')
plt.ylabel('Accuracy')
Output:
K-Nearest Neigbor
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
%matplotlib inline
numNeighbors = [1, 5, 10, 15, 20, 25, 30]
knn_acc = []
for nn in numNeighbors:
clf = KNeighborsClassifier(n_neighbors=nn)
scores = cross_val_score(clf, X, Y, cv=5)
knn_acc.append(scores.mean())
plt.plot(numNeighbors, knn_acc, 'ro-', color = 'blue')
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
Output:
Logestic Regression
from sklearn import linear_model
regularizers = [0.001, 0.01, 0.1, 0.5, 1]
logistic_acc = []
for C in regularizers:
clf = linear_model.LogisticRegression(C=C)
scores = cross_val_score(clf, X, Y, cv=5)
logistic_acc.append(scores.mean())
print(logistic_acc)
plt.plot(regularizers, logistic_acc, 'ro-')
plt.xlabel('Regularizer')
plt.ylabel('Accuracy')
Output:
Now lets draw a Bar Chart to compare the test accuracies using the 3 methods above!
data = p.read_csv('test.csv',header='infer')
data = data.drop('date',axis=1)
Y_test = data['Occupancy']
X_test = data.drop('Occupancy',axis=1)
from sklearn import tree
from sklearn.metrics import accuracy_score
bestdepth = 5
clf = tree.DecisionTreeClassifier(max_depth=bestdepth)
clf = clf.fit(X, Y)
Ypred = clf.predict(X_test)
tree_acc = accuracy_score(Y_test, Ypred)
bestNN = 5
clf = KNeighborsClassifier(n_neighbors=nn)
clf = clf.fit(X, Y)
Ypred = clf.predict(X_test)
knn_acc = accuracy_score(Y_test, Ypred)
bestC = 1
clf = linear_model.LogisticRegression(C=bestC)
clf = clf.fit(X, Y)
Ypred = clf.predict(X_test)
logistic_acc = accuracy_score(Y_test, Ypred)
methods = ['Dtree', 'KNN', 'Logistic']
plt.bar([1,2,3], [tree_acc, knn_acc, logistic_acc])
plt.xticks([1.5,2.5,3.5], methods)
plt.ylim(0.9,1.0)
Output:
Another One (Predictive Modeling)
import pandas as pd
data = pd.read_csv("PRSA_data.csv")
data.head()
Output:
No | year | month | day | hour | pm2.5 | DEWP | TEMP | PRES | cbwd | Iws | Is | Ir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2010 | 1 | 1 | 0 | NaN | -21 | -11.0 | 1021.0 | NW | 1.79 | 0 | 0 |
1 | 2 | 2010 | 1 | 1 | 1 | NaN | -21 | -12.0 | 1020.0 | NW | 4.92 | 0 | 0 |
2 | 3 | 2010 | 1 | 1 | 2 | NaN | -21 | -11.0 | 1019.0 | NW | 6.71 | 0 | 0 |
3 | 4 | 2010 | 1 | 1 | 3 | NaN | -21 | -14.0 | 1019.0 | NW | 9.84 | 0 | 0 |
4 | 5 | 2010 | 1 | 1 | 4 | NaN | -20 | -12.0 | 1018.0 | NW | 12.97 | 0 | 0 |
Lets look the missing values! Lets remove any rows that contain missing values. Lets also discard the columsn named “No”, “year”, “cbwd”.
data = data.drop('No', 1)
data = data.drop('cbwd', 1)
data = data.drop('year', 1)
# Drop any columns with missing values
data = data.dropna()
data.head()
Output:
month | day | hour | pm2.5 | DEWP | TEMP | PRES | Iws | Is | Ir | |
---|---|---|---|---|---|---|---|---|---|---|
24 | 1 | 2 | 0 | 129.0 | -16 | -4.0 | 1020.0 | 1.79 | 0 | 0 |
25 | 1 | 2 | 1 | 148.0 | -15 | -4.0 | 1020.0 | 2.68 | 0 | 0 |
26 | 1 | 2 | 2 | 159.0 | -11 | -5.0 | 1021.0 | 3.57 | 0 | 0 |
27 | 1 | 2 | 3 | 181.0 | -7 | -5.0 | 1022.0 | 5.36 | 1 | 0 |
28 | 1 | 2 | 4 | 138.0 | -7 | -5.0 | 1022.0 | 6.25 | 2 | 0 |
say stuff here!!
Y = pd.DataFrame(data['pm2.5'])
X = pd.DataFrame(data.drop('pm2.5',1))
X.head()
Output:
month | day | hour | DEWP | TEMP | PRES | Iws | Is | Ir | |
---|---|---|---|---|---|---|---|---|---|
24 | 1 | 2 | 0 | -16 | -4.0 | 1020.0 | 1.79 | 0 | 0 |
25 | 1 | 2 | 1 | -15 | -4.0 | 1020.0 | 2.68 | 0 | 0 |
26 | 1 | 2 | 2 | -11 | -5.0 | 1021.0 | 3.57 | 0 | 0 |
27 | 1 | 2 | 3 | -7 | -5.0 | 1022.0 | 5.36 | 1 | 0 |
28 | 1 | 2 | 4 | -7 | -5.0 | 1022.0 | 6.25 | 2 | 0 |
Lets divide the data into 70% training and 30% test sets using train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X,Y,test_size =0.3, random_state=1)
Now Lets fit a Linear Regression model to the training data.
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np
# Create linear regression object
regr = linear_model.LinearRegression()
# Fit regression model to the training set
regr.fit(X_train, y_train)
# Apply model to the test set
y_pred_test = regr.predict(X_test)
# Evaluate the results
print("Root mean squared error = %.4f" % np.sqrt(mean_squared_error(y_test, y_pred_test)))
print("R-square = %.4f" % r2_score(y_test, y_pred_test))
print('Slope Coefficients:', regr.coef_ )
print('Intercept:', regr.intercept_)
plt.scatter(y_test, y_pred_test)
Output:
Root mean squared error = 78.1049
R-square = 0.2560
Slope Coefficients: [[-1.56802833 0.75050291 1.63264151 4.80497766 -6.6414188 -1.47624944-0.25073801 -2.70947104 -7.33484175]]
Intercept: [1660.6910474]