Big Data Analysis Day 8: Predictive Modeling
What is Predictive Modeling/Analytics?
Predictive Modeling/Analytics involves manipulations on data from exisiting data sets with the goal of identifying some new trends or patterns, in which are used to predict future outcomes and trends.
What’s the difference between Predictive Modeling/Analytics and Machine Learning?
 Predictive Modeling is the mathematical approach for which we use statistics and past trends for “future predictions”. While Machine Learning uses various ML models to solve complex problems.
Why is Predictive Analysis Important?
Well we use historical data to predict future outcomes, thus it improves decision making and helps increase profit rates of business/engineering (KPIs).
Predictive Analysis can be used in many fields (domains) including:
 Online Retail
 Engineering
 Eduction
 Predicting Weather Forecasting
 Social Media Analysis
What are the Steps to Perform Predictive Analysis?
 Define Problem Statement
 Collect Data
 Clean Data
 Data Analysis
 Build a Predictive Model
 Validate the Model
 Deployment of the Model
 Monitor the Model
Predictive Modeling

The task of predicting the value of a target variable (y) as a function of the predictor variables (X)

Problem of learning a mapping function from inputs to outputs called function approximation
y = f(X)
Task  Target Variable  Example Applications 

Regression  Quantitative (ratio/interval)  Stock Market Prediction, revenue/sales Forecast, Temperature Prediction, Calorie Prediction 
Classification  Qualitative (nominal)  Disease Prediction, Image Classification, Twitter Mood Prediction 
Classification vs Regression
Regression
Regression is the process of finding a model or function for distinguishing the data into continuous real values instead of using classes
Method of Calculation: Measurement of root mean square error
Examples of Regression Regression Tree (random forest) , Linear Regression
Let’s see how Linear Regression works
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
seed = 1 # seed for random number generation
numInstances = 200 # number of data instances
np.random.seed(seed)
X = np.random.rand(numInstances,1).reshape(1,1)
y_true = 2*X + 1
y = y_true + np.random.normal(size=numInstances).reshape(1,1)
plt.scatter(X, y, color='black')
plt.plot(X, y_true, color='blue', linewidth=3)
plt.title('True function: y = 2X + 1')
plt.xlabel('X')
plt.ylabel('y')
Output:
Classification
Classification is the task of classifying things into subcategories. Definition: Classification is the task of learning a Target function f that maps each attribute set x to one of the predefined class labels y.
It has two types:

Binary Classification: Categorizes given data into two distinct classes

Multiclass Classification: The number of classes is more than 2
Involves prediction of Discrete Values (individually seperate and distinct)
Method of Calculation: Measuring Accuracy
Examples of Classification: Decision Tree, Logestic Regression
Predictive Modeling
Terminology for Predicitve Modeling

Training Set: is the set used for model building, in which contains a set of labeled examples, including the target variablue values are known

Test Set: is used to predict the target values of the unknown data or it’s used to evaluate the performance of the model (if its for evaluation, the target values of test examples must be known)

Predictive Model: An abstract representation of the relationship between the predictor and the target variables
Predictive Modeling Techniques
Single Models
 Decison Tree Methods (examples below!!!)
 Support Vector Machine/Regression
 Artifical Neural Network
Ensemble Models
 Boosting, Random Forest, Bagging
Lets work with a Decision Tree Classifier
A Classification Technique is a systematic approach to build a Classification models from an input of data!
A decision tree classifier is organized a series of test questions and conditions in a tree structure.
In a decision tree, the root and internal nodes contain attribute test conditions to “separate” records that have different characteristics, all the terminal node assigned a class label yes or no.
How do we Build a Decision Tree?
 It’s called/referred to as decision tree induction
Building an optimal decision tee is key in decision tree classifiers, there are many efficient algorithms that use Greedy approach/strategy. Greedy Algorithms is an algorithm paradigm that builds up a solution piecebypiece, which makes locally optimum decisions about what attributes to use for dividing the data.
Examples of different Greedy Algorithms:
 Hunts
 ID3
 C4.5
 CART
 SPRINT
All 5 of those algorithms are used in Greedy decision tree induction.
How do we Determine the Best Attribute test Condition?
We have to split the “Nodes”. We can do a twoway split or a multiway split. A twoway split is easy for when there are binary attributes (Yes or No). If there is a Nominal (many names) attributes which could have many values, a test condition can be expressed into a multiway split. For continuous attributes (example of money $0 1,000,000), we could split into comparison tests ($0100,000, $100,001500,000, $500,0011,000,000) (< & >). Since there are many ways to specify the test conditions from a given training set, we need a measurement to determine the best way to split the records.
The goal of the best test condition is whether it leads to a homogenous distribution of the nodes, which is the purity of child nodes before and after splitting. The larger the degree of purity, the better the class distribution.
Measurement of impurity can be determined by:
 Gina Index (used in our example below)
 Entropy
 Misclassification Error
Decision Tree Example Notes:
Example of a Decision Tree Classification

using sklearn in Python

Using a NBA Draft Combine Information file to predict if the player will be drafted based on picking a few different categories.
Heres the csv file > file
Steps
Step 1: Load the data
import pandas as p
data = p.read_csv('nba_draft_combine_all_years.csv', header = 0)
data
Output:
Player  Year  Drafted  Height (No Shoes)  Height (With Shoes)  Wingspan  Standing reach  Vertical (Max)  Vertical (Max Reach)  Vertical (No Step)  Vertical (No Step Reach)  Weight  Body Fat  Hand (Length)  Hand (Width)  Bench  Agility  Sprint  

0  Blake Griffin  2009  Yes  80.50  82.00  83.25  105.0  35.5  140.5  32.0  137.0  248.0  8.2  NaN  NaN  22.0  10.95  3.28 
1  Terrence Williams  2009  Yes  77.00  78.25  81.00  103.5  37.0  140.5  30.5  134.0  213.0  5.1  NaN  NaN  9.0  11.15  3.18 
2  Gerald Henderson  2009  Yes  76.00  77.00  82.25  102.5  35.0  137.5  31.5  134.0  215.0  4.4  NaN  NaN  8.0  11.17  3.14 
3  Tyler Hansbrough  2009  Yes  80.25  81.50  83.50  106.0  34.0  140.0  27.5  133.5  234.0  8.5  NaN  NaN  18.0  11.12  3.27 
4  Earl Clark  2009  Yes  80.50  82.25  86.50  109.5  33.0  142.5  28.5  138.0  228.0  5.2  NaN  NaN  5.0  11.17  3.35 
…  …  …  …  …  …  …  …  …  …  …  …  …  …  …  …  …  …  … 
512  Peter Jok  2017  No  76.25  77.75  80.00  102.0  31.0  133.0  26.5  128.5  202.0  11.0  8.25  9.50  NaN  11.34  3.41 
513  Rawle Alkins  2017  No  74.50  75.75  80.75  99.0  40.5  139.5  31.5  130.5  223.0  11.0  8.75  10.00  NaN  11.99  3.30 
514  Sviatoslav Mykhailiuk  2017  No  78.50  79.50  77.00  100.0  33.0  133.0  27.0  127.0  220.0  11.4  8.00  9.25  NaN  12.40  3.53 
515  Thomas Welsh  2017  No  83.50  84.50  84.00  109.5  NaN  NaN  NaN  NaN  254.0  10.9  9.00  10.50  NaN  NaN  NaN 
516  V.J. Beachem  2017  No  78.25  80.00  82.25  104.5  37.0  141.5  30.0  134.5  193.0  6.8  8.50  9.00  NaN  11.18  3.26 
Step 2 Lets clean this data because there are missing values, here are the columns were intrested in using!
feature_columns = ["Height (No Shoes)", "Wingspan", "Standing reach","Vertical (Max)", "Weight", "Agility", "Sprint"]
update_data = data[feature_columns]
update_data
Output:
Height (No Shoes)  Wingspan  Standing reach  Vertical (Max)  Weight  Agility  Sprint  

0  80.50  83.25  105.0  35.5  248.0  10.95  3.28 
1  77.00  81.00  103.5  37.0  213.0  11.15  3.18 
2  76.00  82.25  102.5  35.0  215.0  11.17  3.14 
3  80.25  83.50  106.0  34.0  234.0  11.12  3.27 
4  80.50  86.50  109.5  33.0  228.0  11.17  3.35 
…  …  …  …  …  …  …  … 
512  76.25  80.00  102.0  31.0  202.0  11.34  3.41 
513  74.50  80.75  99.0  40.5  223.0  11.99  3.30 
514  78.50  77.00  100.0  33.0  220.0  12.40  3.53 
515  83.50  84.00  109.5  NaN  254.0  NaN  NaN 
516  78.25  82.25  104.5  37.0  193.0  11.18  3.26 
Now lets check for missing values!!!
update_data.describe()
Output:
Height (No Shoes)  Wingspan  Standing reach  Vertical (Max)  Weight  Agility  Sprint  

count  517.000000  517.000000  517.000000  450.000000  516.000000  444.000000  446.000000 
mean  77.609284  82.497292  103.275629  35.136667  214.833333  11.330248  3.299664 
std  3.287633  3.943068  4.897515  3.561688  24.683537  0.563144  0.128422 
min  68.250000  70.000000  88.500000  25.000000  149.000000  10.070000  3.010000 
25%  75.250000  79.750000  100.000000  32.500000  196.000000  10.940000  3.200000 
50%  77.750000  82.500000  103.500000  35.000000  213.500000  11.255000  3.280000 
75%  80.000000  85.500000  107.000000  37.500000  232.000000  11.660000  3.380000 
max  85.250000  92.500000  115.000000  44.500000  303.000000  13.440000  3.810000 
We can see the count for the total players is 517, and in different columns like Vertical, Weight, Agility, and Sprint here is missing data. Ill just insert the Median value for the missing values.
## Vertical Max
update_data['Vertical (Max)'] = update_data['Vertical (Max)'].fillna(update_data['Vertical (Max)'].median())
## Weight
update_data['Weight'] = update_data['Weight'].fillna(update_data['Weight'].median())
## Agility
update_data['Agility'] = update_data['Agility'].fillna(update_data['Agility'].median())
## Sprint
update_data['Sprint'] = update_data['Sprint'].fillna(update_data['Sprint'].median())
Now all the Columns are filled in, no more missing data!
update_data
Output:
Height (No Shoes)  Wingspan  Standing reach  Vertical (Max)  Weight  Agility  Sprint  

0  80.50  83.25  105.0  35.5  248.0  10.950  3.28 
1  77.00  81.00  103.5  37.0  213.0  11.150  3.18 
2  76.00  82.25  102.5  35.0  215.0  11.170  3.14 
3  80.25  83.50  106.0  34.0  234.0  11.120  3.27 
4  80.50  86.50  109.5  33.0  228.0  11.170  3.35 
…  …  …  …  …  …  …  … 
512  76.25  80.00  102.0  31.0  202.0  11.340  3.41 
513  74.50  80.75  99.0  40.5  223.0  11.990  3.30 
514  78.50  77.00  100.0  33.0  220.0  12.400  3.53 
515  83.50  84.00  109.5  35.0  254.0  11.255  3.28 
516  78.25  82.25  104.5  37.0  193.0  11.180  3.26 
Step 3: Extract the predictor (X) and the Target (y) variables
X = update_data
y = data["Drafted"]
Now that we have the data split.
Step 3: Divide the data into training and test sets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
Note: test_size 0.3 means that 70% training and 30% test
Lets look at X_train,X_test, y_train, and y_test
X_train
Output:
Height (No Shoes)  Wingspan  Standing reach  Vertical (Max)  Weight  Agility  Sprint  

13  79.75  81.25  106.5  32.5  211.0  11.150  3.28 
61  76.75  80.50  104.0  35.5  208.0  11.860  3.19 
453  75.25  81.75  97.5  35.5  212.0  10.770  3.45 
39  71.25  74.00  95.0  33.0  175.0  10.870  3.10 
373  76.50  80.25  100.5  34.5  209.0  11.255  3.27 
…  …  …  …  …  …  …  … 
129  76.00  81.75  101.5  35.5  198.0  11.200  3.09 
144  78.50  85.25  106.5  30.5  242.0  12.430  3.46 
72  75.75  81.00  101.0  40.0  203.0  11.380  3.15 
235  79.50  86.50  106.5  38.0  236.0  11.820  3.52 
37  72.50  75.75  97.0  31.0  193.0  10.990  3.22 
There are 361 rows in X_train!
X_test
Output:
Height (No Shoes)  Wingspan  Standing reach  Vertical (Max)  Weight  Agility  Sprint  

270  73.00  76.00  97.50  35.0  179.0  11.255  3.28 
90  78.75  79.75  103.00  34.5  211.0  11.730  3.22 
133  82.00  87.75  109.50  36.0  217.0  11.880  3.35 
221  81.75  87.50  111.50  35.0  230.0  11.255  3.28 
224  76.50  79.00  101.00  37.5  199.0  10.680  3.25 
…  …  …  …  …  …  …  … 
494  75.00  78.50  99.50  35.0  185.0  11.380  3.35 
95  76.00  78.75  101.75  32.0  207.0  11.430  3.17 
122  79.75  84.25  106.50  30.5  247.0  12.740  3.45 
165  75.25  80.50  98.00  36.5  212.0  11.360  3.37 
23  80.75  85.00  107.00  35.0  240.0  11.980  3.14 
There are 156 rows in the X_test!
y_train
Output:
13  Yes 
61  Yes 
453  No 
39  Yes 
373  No 
…  
129  Yes 
144  No 
72  Yes 
235  Yes 
37  Yes 
Name: Drafted, Length: 361, dtype: object 
There are 361 rows in y_train.
y_test
Output:
270  No 
90  Yes 
133  Yes 
221  Yes 
224  Yes 
…  
494  No 
95  No 
122  Yes 
165  Yes 
23  Yes 
Name: Drafted, Length: 156, dtype: object 
There are 156 rows in y_test.
Step 4: Build the decision tree model and apply it to the test data
from sklearn import tree
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
Step 5: Evaluate the accuracy of the model on the test data
from sklearn.metrics import accuracy_score
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Output:
Accuracy: 0.6794871794871795
We can Visualize the Decision Tree using Scikitlearn’s export_graphviz function:
from sklearn import tree
import pydotplus
dot_data = tree.export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("NBA_Draft_Prediction.pdf")
from IPython.display import Image
Image(graph.create_png())
Output:
Model Evaluation
 Accuracy
Alternative Measures for Classification
ROC Curve
What is a ROC Curve? An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. The curve plots two parameters: True Positive Rate (TPR) and False Positive Rate (FPR).
Python Example Confusion Matrix Using the NBA Draft Prediction that we’ve been working with above!!
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred, labels=['No','Yes'])
cm
Output:
array([[13, 26],
[27, 90]])
Python Example F Score
FMeasure: (or FScore) is a measure of a test’s accuracy
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred, labels=['No','Yes'], pos_label = "Yes")
f1
Output:
0.7725321888412018
Y_prob = clf.predict_proba(X_test)
Y_prob
Output:
array([[0.33333333, 0.66666667],
[1. , 0. ],
[0.33333333, 0.66666667],
...,
[0.88888889, 0.11111111],
[0.33333333, 0.66666667],
[0.33333333, 0.66666667]])
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
%matplotlib inline
true_labels = (Y_test == 'Yes');
fpr, tpr, thresholds = roc_curve(true_labels, Y_prob[:,1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr)
Output:
Holdout Method
 add later
5Fold Cross Validation Method
from sklearn.model_selection import cross_val_score
clf = tree.DecisionTreeClassifier(max_depth=3)
scores = cross_val_score(clf, X, Y, cv=5)
scores
Output:
array([0.72727273, 0.72077922, 0.74025974, 0.7124183 , 0.74509804])
print("Accuracy: %0.2f (+/ %0.2f)" % (scores.mean(), scores.std() * 2))
Output:
Accuracy: 0.73 (+/ 0.02)
Extras
import pandas as p
data = p.read_csv('vehicle.csv',header=0)
data.head()
Output:
compactness  circularity  distance_circularity  radius_ratio  pr_axis_aspect_ratio  max_length_aspect_ratio  scatter_ratio  elongatedness  pr_axisrectangular  lengthrectangular  majorvariance  minorvariance  gyrationradius  majorskewness  minorskewness  minorkurtosis  majorkurtosis  hollows_ratio  class  

0  95  43  96  202  65  10  189  35  22  143  217  534  166  71  6  27  190  197  opel 
1  96  52  104  222  67  9  198  33  23  163  217  589  226  67  12  20  192  201  opel 
2  107  52  101  218  64  11  202  33  23  164  219  610  192  65  17  2  197  206  opel 
3  97  37  78  181  62  8  161  41  20  131  182  389  117  62  2  28  203  211  opel 
4  96  54  104  175  58  10  215  31  24  175  221  682  222  75  13  23  186  194  opel 
data[['compactness','circularity','distance_circularity','radius_ratio']].corr()
Output:
compactness  circularity  distance_circularity  radius_ratio  

compactness  1.000000  0.692869  0.792444  0.691659 
circularity  0.692869  1.000000  0.798492  0.622778 
distance_circularity  0.792444  0.798492  1.000000  0.771644 
radius_ratio  0.691659  0.622778  0.771644  1.000000 
Histogram
%matplotlib inline
data['compactness'].hist()
Box Plot
data[['compactness','circularity','distance_circularity','radius_ratio']].boxplot()
from sklearn.model_selection import train_test_split
Y = data['class']
X = data.drop('class',axis=1)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.23, random_state=1)
Y
Output:
0  opel 
1  opel 
2  opel 
3  opel 
4  opel 
…  
841  van 
842  van 
843  van 
844  van 
845  van 
Model Selection and Overfitting (More on Predictive Modeling)
Model Selection
What is Model Overfitting?

Model Overfitting is error that occur when a function is too closely fit to a limited set of data points

Too many details and noise

The goal of Predictive Modeling is to build a model with low training and test errors

Which is challenging because a model with low training errors does not guarantee it will have low test error (due to model overfitting problem)
An example of Model Overfitting:
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import random
%matplotlib inline
N = 1500
mean1 = [6, 14]
mean2 = [10, 6]
mean3 = [14, 14]
cov = [[3.5, 0], [0, 3.5]] # diagonal covariance
np.random.seed(50)
x1, y1 = np.random.multivariate_normal(mean1, cov, N//6).T
x2, y2 = np.random.multivariate_normal(mean2, cov, N//6).T
x3, y3 = np.random.multivariate_normal(mean3, cov, N//6).T
x4 = 20*np.random.random(N//2)
y4 = 20*np.random.random(N//2)
plt.plot(x1,y1,'ro',x2,y2,'ro',x3,y3,'ro',x4,y4,'g+',ms=4)
Output:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.8, random_state=1)
from sklearn import tree
from sklearn.metrics import accuracy_score
maxdepths = [2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,45,50]
trainAcc = np.zeros(len(maxdepths))
testAcc = np.zeros(len(maxdepths))
index = 0
for depth in maxdepths:
clf = tree.DecisionTreeClassifier(max_depth=depth)
clf = clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc[index] = accuracy_score(Y_train, Y_predTrain)
testAcc[index] = accuracy_score(Y_test, Y_predTest)
index += 1
plt.plot(maxdepths,trainAcc,'ro',maxdepths,testAcc,'bv')
plt.legend(['Training Accuracy','Test Accuracy'])
plt.xlabel('Max depth')
plt.ylabel('Accuracy')
Output:
You can see the plot above shows the training accuracy will continue to imporive as the “maximum depth” becomes more complex. However, the test accuracy initally improves up to a maximum depth of 5, before it sharply decreases due to model overfitting.
Building Trees of Different Sizes
from sklearn import tree
from sklearn.metrics import accuracy_score
maxdepths = [2,3,4,5,6,7,8,9,10,15,20,25,30,35,40,45,50]
trainAcc = np.zeros(len(maxdepths))
testAcc = np.zeros(len(maxdepths))
index = 0
for depth in maxdepths:
clf = tree.DecisionTreeClassifier(max_depth=depth)
clf = clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
trainAcc[index] = accuracy_score(Y_train, Y_predTrain)
testAcc[index] = accuracy_score(Y_test, Y_predTest)
index += 1
Lets plot a Tree with MaxDepth = 4
depth = 4
clf = tree.DecisionTreeClassifier(max_depth=depth)
clf = clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
AccTrain = accuracy_score(Y_train, Y_predTrain)
AccTest = accuracy_score(Y_test, Y_predTest)
[AccTrain, AccTest]
Output:
[0.8888888888888888, 0.6959349593495935]
What do these numbers tell us?
 With a tree of depth 4, our training accuracy was 88.88% which is pretty solid. Once we tested our model on our test data, the accuracy was 69.59% which is also solid.
Lets Plot the Tree with Depth = 4
import pydotplus
from IPython.display import Image
dot_data = tree.export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
Output:
Lets plot a Tree with MaxDepth = 20
A tree with a MaxDepth of 20 should be very acccurate for the training data set. But how about when we test our model on the test data??
depth = 20
clf = tree.DecisionTreeClassifier(max_depth=depth)
clf = clf.fit(X_train, Y_train)
Y_predTrain = clf.predict(X_train)
Y_predTest = clf.predict(X_test)
AccTrain = accuracy_score(Y_train, Y_predTrain)
AccTest = accuracy_score(Y_test, Y_predTest)
[AccTrain, AccTest]
Output:
[1.0, 0.640650406504065]
Accuracy Training = 100%
Accuracy Test = 64.07%
As you can see our training data was 100% accurate, but it was so overfitted aka too complex, that the test data accuracy was lower than the tree with a max depth of 4!
 The model may fit exceptional cases (rather than normal) in the training data!!
Lets Plot the Tree with Depth = 20
import pydotplus
from IPython.display import Image
dot_data = tree.export_graphviz(clf, out_file=None)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
Lets Plot Model Overfitting and Underfitting
plt.plot(maxdepths,trainAcc,'ro',maxdepths,testAcc,'bv')
plt.legend(['Training Accuracy','Test Accuracy'])
Output:
Model Overfitting and Underfitting
Defintions:

Underfitting: When the model is too simple

Overfitting: When the model is too complex
Model Selection Using Validation Set

A general model selection approach that does not depend on the predicitive modeling technique used

It divides the training data into two parts:
 Training Set: for model building
 Validation set: for estimating generalization error
A drawback is that less data would be available for training
Lets do an example of Model Selection Using Validation Set!
Other Predictive Modeling Techniques with Classification in Sklearn (L10)
NearestNeighbor Methods
What is NearestNeighbors?
 One of the simple/basic classification algorithms in Machine Learning.
 Supervised Learning

Most People use K Nearest Neighbors
 Linear Method
 Nonlinear Method
 Ensemble Methods
K Nearest Neighbors
 Common algorithm used to solve classification model problems
K Nearest Neighbors in Sklearn
 First opened the .data file and added a “header row” with the columns:
id,clump_thickness,unif_cell_size,unif_cell_shape,marg_adhesion,single_epith_cell_size,bare_nuclei,bland_chrom,norm_nucleoli,mitoses,class
import numpy as np
from sklearn import preprocessing, neighbors
from sklearn.model_selection import cross_validate
import pandas as pd
df = pd.read_csv('breastcancerwisconsin.data')
# Replace missing data!
df.replace('?',99999, inplace=True)
## Drop ID, completly useless
df.drop(['id'], 1, inplace=True)
df
Output:
clump_thickness  uniform_cell_size  uniform_cell_shape  marginal_adhesion  single_epi_cell_size  bare_nuclei  bland_chromation  normal_nucleoli  mitoses  class  

0  5  1  1  1  2  1  3  1  1  2 
1  5  4  4  5  7  10  3  2  1  2 
2  3  1  1  1  2  2  3  1  1  2 
3  6  8  8  1  3  4  3  7  1  2 
4  4  1  1  3  2  1  3  1  1  2 
…  …  …  …  …  …  …  …  …  …  … 
694  3  1  1  1  3  2  1  1  1  2 
695  2  1  1  1  2  1  1  1  1  2 
696  5  10  10  3  7  3  8  10  2  4 
697  4  8  6  4  3  4  10  6  1  4 
698  4  8  8  5  4  5 
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
accuracy
Output:
0.9785714285714285
Lets add
example_measures = np.array([4,2,1,1,1,2,3,2,1])
example_measures = example_measures.reshape(1, 1)
prediction = clf.predict(example_measures)
Output:
array([2])
Ensemble Methods
 Ensemble Methods is taking the orginal training data
 Creating Multiple Data Sets
 Build Multiple Base Classifiers
 Combine the Base Classifiers
Why use Ensemble models?
 Better accuracy
 Higher Consistency
 Reduces bias and variance errors
When and Where to use Ensemble Models?
 When single model overfits (decision trees usually overfit)
What are the different Ensemble Methods?
 Bagging: Build multiple classifiers by resampling the data with replacement
 Boosting: Build multiple classifiers by resampling the (weighted) data with replacement
 Random Forests: Train multiple decision tree classifiers and combine their prediction
Python Example