# Clustering Analysis

## Introduction of Cluster Analysis

What is Clustering?

• Clustering is partitioning (dividing up) the population of data points into groups that are more similar to other data points in the same group

• Helps identify two qualities of data:

1. Meaningfulness
2. Usefulness

Why Clustering over Classification?

• Clustering can adapt to changes made and helps single out useful features that differentiate different groups

### Examples of using Clustering

Data Attribute Set (x) Clustering Task
Customer Demographic and Location Information Group together similar customers
Word Document Words Group Documents based on similar words
Ads Number of different Ads Group together similar people and target them
Fantasy Football Performance Data Group together similar players to compare

### Clustering Methods

• Partitioning Method

• K-Means all all its variants (Fuzzy k-means aka similar words)
• Support Vector Clustering

• Model-Based Method

• Heirarchial Method

## K-Means Clustering

• Unsuprervised Machine Learning Technique

• One of the more popular “clustering” algorithms, K-Means stores centroids that it uses to define clusters. A point is considered to be in a particular cluster if its close to that cluster’s centroid than any other centroid.

## K-Means Algorithm

Steps: (Pseudocode)

1. Specify the number of k clusters to assign.
2. Randomly pick number of k centroids.
3. Repeat
4. Expectation: Assign each point to its closet centroids.
5. Maximization: Compute the new centroid (mean) of each cluster.
6. Until The centroid positions do not change.

Note: The quality of the cluster assignment is determined by computing the sum of the squared error (SSE) after the centroids converge.       ### Example of K-Means Algorithm in Python

sklearn

Import Libraries

``````from sklearn.cluster import KMeans
import pandas as pd

from matplotlib import pyplot as plt
%matplotlib inline
``````
• Using NBA data with player age and salary! (note some salaries are not correct )
``````data = pd.read_csv("salary_nba.csv")
data
``````

Output:

Name Age Salary
0 Lebron James 36 40000000
1 Steph Curry 31 41000000
2 Kevin Durant 33 39000000
3 Chris Paul 35 41500000
4 Anthony Davis 28 33000000
5 Dennis Schroder 27 15500000
6 KCP 28 12000000
7 Montrezl Harrell 27 16000000
8 Alex Caruso 25 14750000
9 Kyle Kuzma 27 12960000
10 Markieff Morris 25 2300000
11 Alfonzo Mckinnie 28 13000000
12 Talen Horton-Tucker 20 1500000
13 Kostas Antetokounmpo 23 1000000
14 Devontae Cacok 24 1000000
15 Damian Jones 21 785000

Lets plot the Age and Salary

``````plt.scatter(data.Age,data['Salary'])

plt.xlabel('Age')
plt.ylabel('Salary')
``````

Output: ``````km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(data[['Age','Salary']])
y_predicted
``````

Output:

``````array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2], dtype=int32)
``````
``````data['cluster']=y_predicted
``````

Output:

Name Age Salary cluster
0 Lebron James 36 40000000 1
1 Steph Curry 31 41000000 1
2 Kevin Durant 33 39000000 1
3 Chris Paul 35 41500000 1
4 Anthony Davis 28 33000000 1
``````## Location of "Cluster Centers (x,y)"
kmeans.cluster_centers_
``````

Output:

array([[2.7000e+01, 1.4035e+07], [3.2600e+01, 3.8900e+07], [2.2600e+01, 1.3170e+06]])

``````data1 = data[data.cluster==0]
data2 = data[data.cluster==1]
data3 = data[data.cluster==2]
plt.scatter(data1.Age,data1['Salary'],color='blue')
plt.scatter(data2.Age,data2['Salary'],color='orange')
plt.scatter(data3.Age,data3['Salary'],color='green')
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],color='purple',marker='*',label='centroid')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.legend()
``````

Output: ### Errors with K-Means

• K-Means Clustering was designed to minimize the sum of square error (SSE)
• SSE measures the sum-of-squared distance between every data point x to their cluster centroid c

• SSE can be used to determine the number of clusters

Lets use the Sum of Squared Error (SEE) to plot an Elbow Plot

``````sse = []
k_range = range(1,10)
for k in k_range:
km = KMeans(n_clusters=k)
km.fit(data[['Age','Salary']])
sse.append(km.inertia_)

## label and plot the Elbow Graph
plt.xlabel('K')
plt.ylabel('Sum of squared error')
plt.plot(k_range,sse)
``````

Output: We can use the Elbow Graph to determine the number of clusters we want!

## More Examples using Clustering (another NBA Example)

Lets look at the Top 50 NBA Scorers!!

Lets open our File:

``````data = pd.read_csv('top_50_NBA.csv')
``````

Lets use Clustering to cluster Points and Usage categories.

``````kmeans = KMeans(n_clusters = 4, random_state = 98)

x = np.column_stack((data['PTS'], data['USG%']))

## Fit to our Kmeans
kmeans.fit(x)

y_kmeans = kmeans.predict(x)
``````
``````for i, j in zip(df_counting['Player'], y_kmeans):
print(i, j)
``````

Output:

``````James Harden 2
Paul George 1
Giannis Antetokounmpo 1
Joel Embiid 1
Stephen Curry 1
Kawhi Leonard 1
Devin Booker 1
Kevin Durant 1
Damian Lillard 1
Kemba Walker 1
Blake Griffin 1
Karl-Anthony Towns 1
Kyrie Irving 1
Donovan Mitchell 1
Zach LaVine 1
Russell Westbrook 1
Klay Thompson 3
Julius Randle 3
LaMarcus Aldridge 3
Jrue Holiday 3
DeMar DeRozan 3
Luka Doncic 3
Mike Conley 3
DAngelo Russell 3
CJ McCollum 3
Nikola Vucevic 3
Buddy Hield 3
Nikola Jokic 3
Tobias Harris 0
Lou Williams 3
Danilo Gallinari 0
John Collins 0
Trae Young 3
Jimmy Butler 0
Kyle Kuzma 0
Khris Middleton 0
Jamal Murray 0
Andrew Wiggins 0
J.J. Redick 0
Tim Hardaway 0
Bojan Bogdanovic 0
Andre Drummond 0
DeAaron Fox 0
Ben Simmons 0
Pascal Siakam 0
Spencer Dinwiddie 0
Jordan Clarkson 3
Collin Sexton 0
Clint Capela 0
``````

Graph the Clustered Points and Usage

``````points_usage_clustered, ax = plt.subplots()

# List of the 4 Clusters

cluster_1 = []
cluster_2 = []
cluster_3 = []
cluster_4 = []

# Add Cluster to each desinated cluster list above
for i in range(len(y_kmeans)):
if(y_kmeans[i] == 0):
cluster_1.append(x[i])
elif(y_kmeans[i] == 1):
cluster_2.append(x[i])
elif(y_kmeans[i] == 2):
cluster_3.append(x[i])
elif(y_kmeans[i] == 3):
cluster_4.append(x[i])

# vstack stacks things refer to numpy doc --> https://numpy.org/doc/stable/reference/generated/numpy.vstack.html

#vstacks is an array so array([[21.5, 25.6],[points, usage(%)],..])

cluster_1 = np.vstack(cluster_1)
cluster_2 = np.vstack(cluster_2)
cluster_3 = np.vstack(cluster_3)
cluster_4 = np.vstack(cluster_4)

# ax.scatter(x,y, label = 'name of what were plotting', s = size )
ax.scatter(cluster_1[:, 0], cluster_1[:, 1], label = "Cluster 1 (Bottom Top 50)")
ax.scatter(cluster_2[:, 0], cluster_2[:, 1], label = "Cluster 2 (Top Scorers in NBA)")
ax.scatter(cluster_3[:, 0], cluster_3[:, 1], label = "Cluster 3 (James Freaking Harden)")
ax.scatter(cluster_4[:, 0], cluster_4[:, 1], label = "Cluster 4 (Middle Top 50 Scorers) ")

## Heres the centroids! array([[20.56666667, 27.84666667], [x,y],..])
centers = kmeans.cluster_centers_

## Now we can plot the centroids (4 of them)
ax.scatter(centers[:, 0], centers[:, 1], c = 'black', s = 200, alpha = .5, label = 'Cluster Center')

# lets plot the Legend, which will include the labels for each of our 4 clusters above
ax.legend(loc ='best', prop = {'size': 8})

## Labeling the x-axis and y-axis
ax.set_xlabel('Points Per Game')
ax.set_ylabel('Usage (%)')

## Naming our Scattered Plot!!!!
points_usage_clustered.suptitle("Clustered Points and Usage")

plt.show()
``````

Output: ## Limitations of K-Means

When are there problems with K-Means Clustering?

• When clusters are of differing Sizes, Densities, and Non-Globular Shapes (think dount shaped data)
• When there are outliers in the Data (James Harden in the example above)

# Part 2 of Cluster Analysis

## Cluster Validity

• For supervised classification have different ways to measure how good our model is!

• Accuracy, True Positive Rate

How does one evalute the “goodness” of the resulting Clusters?

• Isn’t a easy answer because “clusters” are in the eye of the beholder!

• Notion of a cluster can be Ambiguous

### Issues in Cluster Validation

• How many Clusters are there in the data?
• Are the clusters real or are they nothing more than “accidental” groupings of the data

We need some sort of statistical framework to interpet a measure for Cluster Validity

• To measure cluster quality and a statistical approach for testing vality of the clusters

### Rand Index

What is the Rand Index? (External Measure)

• Rand Index or Rand Measure in statistics (data clustering), is a measure of the similarity between two two clustering.

### Python Examples of Cluster Validation (using the Rand Index)

``````import pandas as pd

``````

Output:

preg plas pres skin insu mass pedi age class
0 6 148 72 35 0 33.6 0.627 50 tested_positive
1 1 85 66 29 0 26.6 0.351 31 tested_negative
2 8 183 64 0 0 23.3 0.672 32 tested_positive
3 1 89 66 23 94 28.1 0.167 21 tested_negative
4 0 137 40 35 168 43.1 2.288 33 tested_positive
``````Y = (data['class'] == 'tested_positive')*1
X = data.drop('class',axis=1)
``````
``````from sklearn import cluster, metrics

## group into 2 clusters tested positive and tested negative
k_means = cluster.KMeans(n_clusters = 2)
# Fit
k_means.fit(X)
``````

Confusion Matrix

``````metrics.confusion_matrix(Y,k_means.labels_)

``````

Output:

``````array([[421,  79],
[182,  86]])
``````

ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)

``````metrics.adjusted_rand_score(Y,k_means.labels_)
``````

Output:

## Hierarchical Clustering

• Hierachical Clustering produces nested clusters organized as a tree!
• It has the advantage of not having a pre-defined number of clusters, but it doesn’t work well when we have huge amounts of data!

### Python Examples of Hierarchical Clustering

``````import pandas as pd

names = data['Name']
Y = data['Class']
## Drop Name Column and Class (predictor) Column
X = data.drop(['Name','Class'],axis=1)
data
``````

Output:

Name warmBlooded coldBlooded GiveBirth LayEggs CanSwim CanFly HasLegs HasScales Class
0 human 1 0 1 0 0 0 1 0 mammal
1 python 0 1 0 1 0 0 0 1 reptile
2 salmon 0 1 0 1 1 0 0 1 fish
3 whale 1 0 1 0 1 0 0 0 mammal
4 frog 0 1 0 1 1 0 1 0 amphibian
5 komodo dragon 0 1 0 1 0 0 1 1 reptile
6 bat 1 0 1 0 0 1 1 0 mammal
7 pigeon 1 0 0 1 0 1 1 0 bird
8 cat 1 0 1 0 0 0 1 0 mammal
9 leopard shark 0 1 1 0 1 0 0 1 fish
10 turtle 0 1 0 1 1 0 1 1 reptile
11 penguin 1 0 0 1 1 0 1 0 bird
12 porcupine 1 0 1 0 0 0 1 0 mammal
13 eel 0 1 0 1 1 0 0 1 fish
14 salamander 0 1 0 1 1 0 1 0 amphibian
15 alligator 0 1 0 1 1 0 1 1 reptile

Import these Libraries

``````import numpy as np
# convert X to a matrix using nunpy
X.to_numpy()
X
``````

Output:

``````array([[1, 0, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 1, 0, 0, 0, 1],
[0, 1, 0, 1, 1, 0, 0, 1],
[1, 0, 1, 0, 1, 0, 0, 0],
[0, 1, 0, 1, 1, 0, 1, 0],
[0, 1, 0, 1, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 1, 1, 0],
[1, 0, 0, 1, 0, 1, 1, 0],
[1, 0, 1, 0, 0, 0, 1, 0],
[0, 1, 1, 0, 1, 0, 0, 1],
[0, 1, 0, 1, 1, 0, 1, 1],
[1, 0, 0, 1, 1, 0, 1, 0],
[1, 0, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 1, 1, 0, 0, 1],
[0, 1, 0, 1, 1, 0, 1, 0],
[0, 1, 0, 1, 1, 0, 1, 1]])

``````

Intro to Scipy:

scipy.cluster.hierarchy.linkage( y, Method , Metric, optimal_ordering)

y: May be either a 1-D condensed distance matrix or a 2-D array of observation vectors

Method(s): Used to compute the distance d(s,t) between the 2 clusters s and t (Inter-Cluster Proximity)

• method = ‘single’ (or Min, shortest distance)
• method = ‘complete’ (or Max, largest distance)
• method = ‘average’
• method = ‘ward’ (Proximity of two clusters is based on the increase in the SSE when two clusters are merged)
• method = ‘weighted’
• method = ‘centroid’
• method = ‘median’

Metric: The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. optimal_ordering: If True, the linkage matrix will be reordered so that the distance between successive leaves is minimal

Note that method, metric and optimal_ordering are all optional

Returns:

• a hierarchical clustering encoded as a linkage matrix.

Parameters:

• Z: The linkage matrix encoding the hierarchical clustering to render as a dendrogram
• Orientation: The direction to plot the dendrogram, which can be any of the following strings:

• top, bottom, left, right
• labels: text

Lets start by importing the Libraries needed!

``````from scipy.cluster import hierarchy
import matplotlib.pyplot as plt

%matplotlib inline
from pandas import Series
import numpy as np

dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation='left')

``````

Output: Lets check to see the number of Clusters given the threshold (t = 1.1)

``````Z = hierarchy.linkage(X.to_numpy(), 'complete')
labels = hierarchy.fcluster(Z, t = 1.1)
labels
``````

Output:

``````array([1, 4, 5, 2, 3, 4, 1, 1, 1, 2, 6, 1, 1, 5, 3, 6], dtype=int32)
``````

Rand Score

What is Y again?

``````Y
``````

Output:

``````0        mammal
1       reptile
2          fish
3        mammal
4     amphibian
5       reptile
6        mammal
7          bird
8        mammal
9          fish
10      reptile
11         bird
12       mammal
13         fish
14    amphibian
15      reptile
Name: Class, dtype: object
``````

``````metrics.adjusted_rand_score(Y,labels)
``````

Output:

Rand Score: 0.4411764705882353

Scipy Doc

What does the fcluster to?

• Form flat clusters from the hierarchical clustering defined by the given linkage matrix.
``````Z = hierarchy.linkage(X.to_numpy(), 'complete')
labels = hierarchy.fcluster(Z, t=1)
labels
``````

Output:

``````array([1, 5, 6, 3, 4, 5, 1, 2, 1, 3, 7, 2, 1, 6, 4, 7], dtype=int32)
``````

``````metrics.adjusted_rand_score(Y,labels)
``````

Output:

Rand Score: 0.6180555555555556

# More Examples of Clustering…. (Cluster Analysis)

``````import pandas as pd

``````
date 0:00 0:05 0:10 0:15 0:20 0:25 0:30 0:35 0:40 23:15 23:20 23:25 23:30 23:35 23:40 23:45 23:50 23:55 class
0 4/10/2005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN none
1 4/11/2005 NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.0 6.0 8.0 11.0 7.0 2.0 3.0 6.0 8.0 none
2 4/12/2005 3.0 8.0 10.0 6.0 1.0 4.0 9.0 4.0 6.0 4.0 10.0 2.0 6.0 5.0 4.0 5.0 6.0 7.0 afternoon
3 4/13/2005 6.0 5.0 4.0 4.0 4.0 4.0 4.0 2.0 12.0 7.0 15.0 10.0 4.0 10.0 6.0 10.0 6.0 16.0 evening
4 4/14/2005 7.0 3.0 6.0 11.0 8.0 6.0 6.0 10.0 4.0 5.0 9.0 4.0 4.0 6.0 9.0 5.0 16.0 8.0 none

5 rows × 290 columns

Lets clean up this data!!

``````
X = data.dropt('data', axis = 1 )
y = data['class']
y = y.map({'none': 0, 'afternoon': 1, 'evening': 2})

X = X.drop('class',axis=1)
``````
0:00 0:05 0:10 0:15 0:20 0:25 0:30 0:35 0:40 0:45 23:10 23:15 23:20 23:25 23:30 23:35 23:40 23:45 23:50 23:55
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12.0 12.0 6.0 8.0 11.0 7.0 2.0 3.0 6.0 8.0
2 3.0 8.0 10.0 6.0 1.0 4.0 9.0 4.0 6.0 13.0 10.0 4.0 10.0 2.0 6.0 5.0 4.0 5.0 6.0 7.0
3 6.0 5.0 4.0 4.0 4.0 4.0 4.0 2.0 12.0 2.0 7.0 7.0 15.0 10.0 4.0 10.0 6.0 10.0 6.0 16.0
4 7.0 3.0 6.0 11.0 8.0 6.0 6.0 10.0 4.0 7.0 12.0 5.0 9.0 4.0 4.0 6.0 9.0 5.0 16.0 8.0

Lets replace any empty (missing values) with the mean value

``````mean_value = X.mean()
X = X.fillna(mean_value)
``````

Output:

0:00 0:05 0:10 0:15 0:20 0:25 0:30 0:35 0:40 0:45 23:10 23:15   23:20 23:25 23:30 23:35 23:40 23:45 23:50 23:55
0 8.472727 7.969697 7.343558 7.208589 7.006135 6.411043 6.349693 6.705521 5.98773 5.460123 14.763636 14.042424 13.551515 12.151515 11.684848 10.345455 10.575758 9.321212 9.351515 8.884848
1 8.472727 7.969697 7.343558 7.208589 7.006135 6.411043 6.349693 6.705521 5.98773 5.460123 12.000000 12.000000 6.000000 8.000000 11.000000 7.000000 2.000000 3.000000 6.000000 8.000000
2 3.000000 8.000000 10.000000 6.000000 1.000000 4.000000 9.000000 4.000000 6.00000 13.000000 10.000000 4.000000 10.000000 2.000000 6.000000 5.000000 4.000000 5.000000 6.000000 7.000000
3 6.000000 5.000000 4.000000 4.000000 4.000000 4.000000 4.000000 2.000000 12.00000 2.000000 7.000000 7.000000 15.000000 10.000000 4.000000 10.000000 6.000000 10.000000 6.000000 16.000000
4 7.000000 3.000000 6.000000 11.000000 8.000000 6.000000 6.000000 10.000000 4.00000 7.000000 12.000000 5.000000 9.000000 4.000000 4.000000 6.000000 9.000000 5.000000 16.000000 8.000000

Clustering now

``````from sklearn import cluster

k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X)
values = k_means.cluster_centers_
labels = k_means.labels_

``````

Confusion Matrix

``````from sklearn import metrics

metrics.confusion_matrix(y, k_means.labels_)
``````

Output:

``````array([[ 0, 71, 23],
[ 0,  4, 19],
[36, 13,  9]])
``````

Lets find the Rand Score

``````metrics.adjusted_rand_score(y,k_means.labels_)
``````

Output:

``````0.3557396125157386
``````

Lets plot

``````import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(k_means.cluster_centers_.T)
plt.legend(['Cluster 0','Cluster 1','Cluster 2'])
``````

Output: ### Another Example of K-Means Clustering Algorithm

Here we will write the code for a variation of k-means clustering algorithm known as bisecting k-means.

This is how the algorithm works:

Suppose you want to create 4 clusters. The algorithm will first apply standard k-means algorithm to bisect the entire data into 2 clusters, say, C1 and C2.

It then computes the sum-of-squared error (SSE) of each cluster and choose the cluster with higher SSE to be further partitioned into 2 smaller clusters.

For example, if cluster C1 has larger SSE than C2, the algorithm will apply standard k-means to all the data points in cluster C1 and divide the cluster into 2 smaller clusters, say, C1a and C1b. At this time, you have 3 clusters: C1a, C1b, and C2.

Next, you will compare the SSE of the 3 clusters and choose the one with highest SSE. You will bisect the cluster with highest SSE into 2 smaller clusters, thereby creating the 4 clusters needed.

a) Lets load our csv file into a Pandas DataFrame Object to work with. Lets assign the names of its two columns as ‘x1’ and ‘x2’.

``````import pandas as pd
import matplotlib

%matplotlib inline

#plot scatter graph
data.plot.scatter(x=0,y=1, label = 'data', color = 'green' )
``````

Output:

x1 x2
0 1.1700 1.2809
1 1.5799 0.6373
2 0.2857 0.6620
3 1.2726 0.7440
4 1.1008 0.0689 b) Next lets write the function bisect() that will take in three input arguments: data to be clustered, k (number of clusters), and seed (for random number generator). The function should return the cluster labels as output.

``````from sklearn import cluster
import numpy as np

def bisect(data, k, seed = 1):
labels = pd.Series(np.zeros(data.shape))

SSE = []
k_means = cluster.KMeans(n_clusters=1, random_state=seed)
k_means.fit(data)
SSE.append(k_means.inertia_)
print('Iteration 0  SSE =', SSE)

clusterID = [np.arange(0,data.shape).tolist()]

for numClusters in np.arange(1,k):
s_data = data.iloc[clusterID[SSE.index(max(SSE))]]
k_means = cluster.KMeans(n_clusters=2, random_state=seed)
k_means.fit(s_data)

label = [[],[]]

for i in range(0, s_data.shape):
if k_means.labels_[i]==0:
label.append(clusterID[SSE.index(max(SSE))][i])
elif k_means.labels_[i]==1:
label.append(clusterID[SSE.index(max(SSE))][i])

for j in range(0, 2):
if j == 0:
labels[label[j]] = labels[label[j]]
elif j == 1:
labels[label[j]] = numClusters
clusterID.append(label[j])
SSE.append(((s_data[k_means.labels_==j]-k_means.cluster_centers_[j])**2).sum().sum())

clusterID.pop(SSE.index(max(SSE)))
SSE.pop(SSE.index(max(SSE)))

print('Iteration', numClusters, ' SSE =', SSE)

return labels
``````

c) Apply the bisecting k-means algorithm to generate 8 clusters from the data. Assign the clustering result as another column, named ‘labels’ of the dataframe. Draw a scatter plot of the data using the cluster labels as color of the data points in the scatter plot.

``````data['labels'] = bisect(data, 8, 1)
data.plot.scatter(x='x1',y='x2',c='labels',colormap='jet')
``````

Output: ### More examples

The goal for this example is to predict wheather the office room is occupied (1) or not (0).

``````import pandas as p

``````

Output:

date Temperature Humidity Light CO2 HumidityRatio Occupancy
0 2/4/2015 18:07 23.000 27.200 0.0 681.5 0.004728 0
1 2/4/2015 18:16 22.790 27.445 0.0 689.0 0.004710 0
2 2/4/2015 19:05 22.245 27.290 0.0 602.5 0.004530 0
3 2/4/2015 19:06 22.200 27.290 0.0 598.0 0.004517 0
4 2/4/2015 19:12 22.200 27.290 0.0 593.5 0.004517 0

Drop any useless columns (data) and save X as our columns of intrest and Y as our predictor

``````data = data.drop('date',axis=1)
Y = data['Occupancy']
X = data.drop('Occupancy',axis=1)
``````

Now lets train the following classifiers (using 5-fold cross validation) on the data above and calculate the average accuracy of the cross-validation for each method (Decision Tree, K-Nearest Neighbor, Logistic Regression). Vary the hyperparameter of the classifier and draw a plot that shows the average cross-validation accuracy versus hyperparamter values for each classification method shown below:

1) Decision Tree (MaxDepth = 1,5,10,50,100)

2) K-Nearest Neigbor (k = 1,2,3,4,5,10,15)

3) Logestic Regression (C = 0.001, 0.01, 0.1, 0.5, 1)

Decision Tree Classifier

``````import numpy as np
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
%matplotlib inline

maxdepths = [1, 5, 10, 50, 100]
treeAcc = []

for depth in maxdepths:
clf = tree.DecisionTreeClassifier(max_depth=depth)
scores = cross_val_score(clf, X, Y, cv=5)
treeAcc.append(scores.mean())

plt.plot(maxdepths, treeAcc, 'ro-')
plt.xlabel('Maximum depth')
plt.ylabel('Accuracy')

``````

Output: K-Nearest Neigbor

``````from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
%matplotlib inline

numNeighbors = [1, 5, 10, 15, 20, 25, 30]
knn_acc = []

for nn in numNeighbors:
clf = KNeighborsClassifier(n_neighbors=nn)
scores = cross_val_score(clf, X, Y, cv=5)
knn_acc.append(scores.mean())

plt.plot(numNeighbors, knn_acc, 'ro-', color = 'blue')
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
``````

Output: Logestic Regression

``````from sklearn import linear_model

regularizers = [0.001, 0.01, 0.1, 0.5, 1]
logistic_acc = []
for C in regularizers:
clf = linear_model.LogisticRegression(C=C)
scores = cross_val_score(clf, X, Y, cv=5)
logistic_acc.append(scores.mean())
print(logistic_acc)

plt.plot(regularizers, logistic_acc, 'ro-')
plt.xlabel('Regularizer')
plt.ylabel('Accuracy')
``````

Output: Now lets draw a Bar Chart to compare the test accuracies using the 3 methods above!

``````data = p.read_csv('test.csv',header='infer')
data = data.drop('date',axis=1)
Y_test = data['Occupancy']
X_test = data.drop('Occupancy',axis=1)
``````
``````from sklearn import tree
from sklearn.metrics import accuracy_score

bestdepth = 5
clf = tree.DecisionTreeClassifier(max_depth=bestdepth)
clf = clf.fit(X, Y)
Ypred = clf.predict(X_test)
tree_acc = accuracy_score(Y_test, Ypred)

bestNN = 5
clf = KNeighborsClassifier(n_neighbors=nn)
clf = clf.fit(X, Y)
Ypred = clf.predict(X_test)
knn_acc = accuracy_score(Y_test, Ypred)

bestC = 1
clf = linear_model.LogisticRegression(C=bestC)
clf = clf.fit(X, Y)
Ypred = clf.predict(X_test)
logistic_acc = accuracy_score(Y_test, Ypred)

methods = ['Dtree', 'KNN', 'Logistic']
plt.bar([1,2,3], [tree_acc, knn_acc, logistic_acc])
plt.xticks([1.5,2.5,3.5], methods)
plt.ylim(0.9,1.0)
``````

Output: ### Another One (Predictive Modeling)

``````import pandas as pd

``````

Output:

No year month day hour pm2.5 DEWP TEMP PRES cbwd Iws Is Ir
0 1 2010 1 1 0 NaN -21 -11.0 1021.0 NW 1.79 0 0
1 2 2010 1 1 1 NaN -21 -12.0 1020.0 NW 4.92 0 0
2 3 2010 1 1 2 NaN -21 -11.0 1019.0 NW 6.71 0 0
3 4 2010 1 1 3 NaN -21 -14.0 1019.0 NW 9.84 0 0
4 5 2010 1 1 4 NaN -20 -12.0 1018.0 NW 12.97 0 0

Lets look the missing values! Lets remove any rows that contain missing values. Lets also discard the columsn named “No”, “year”, “cbwd”.

``````data = data.drop('No', 1)
data = data.drop('cbwd', 1)
data = data.drop('year', 1)

# Drop any columns with missing values
data = data.dropna()
``````

Output:

month day hour pm2.5 DEWP TEMP PRES Iws Is Ir
24 1 2 0 129.0 -16 -4.0 1020.0 1.79 0 0
25 1 2 1 148.0 -15 -4.0 1020.0 2.68 0 0
26 1 2 2 159.0 -11 -5.0 1021.0 3.57 0 0
27 1 2 3 181.0 -7 -5.0 1022.0 5.36 1 0
28 1 2 4 138.0 -7 -5.0 1022.0 6.25 2 0

say stuff here!!

``````Y = pd.DataFrame(data['pm2.5'])
X = pd.DataFrame(data.drop('pm2.5',1))
``````

Output:

month day hour DEWP TEMP PRES Iws Is Ir
24 1 2 0 -16 -4.0 1020.0 1.79 0 0
25 1 2 1 -15 -4.0 1020.0 2.68 0 0
26 1 2 2 -11 -5.0 1021.0 3.57 0 0
27 1 2 3 -7 -5.0 1022.0 5.36 1 0
28 1 2 4 -7 -5.0 1022.0 6.25 2 0

Lets divide the data into 70% training and 30% test sets using train_test_split

``````from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X,Y,test_size =0.3, random_state=1)
``````

Now Lets fit a Linear Regression model to the training data.

``````from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np

# Create linear regression object
regr = linear_model.LinearRegression()

# Fit regression model to the training set
regr.fit(X_train, y_train)

# Apply model to the test set
y_pred_test = regr.predict(X_test)

# Evaluate the results

print("Root mean squared error = %.4f" % np.sqrt(mean_squared_error(y_test, y_pred_test)))
print("R-square = %.4f" % r2_score(y_test, y_pred_test))
print('Slope Coefficients:', regr.coef_ )
print('Intercept:', regr.intercept_)

plt.scatter(y_test, y_pred_test)
``````

Output:

Root mean squared error = 78.1049

R-square = 0.2560

Slope Coefficients: [[-1.56802833 0.75050291 1.63264151 4.80497766 -6.6414188 -1.47624944-0.25073801 -2.70947104 -7.33484175]]

Intercept: [1660.6910474] 