Machine Learning Example Using Housing Prediction

5 minute read

Python Jupyter Notebook

Housing Project

About Project

Housing Prediction on Dataset from Kaggle using housing data from Armes,Iowa.

Here is the Link to the Dataset to Download on Kaggle

The Purpose of this project is to use different machine learning algorithms to predict the price of the house to sell.

Questions we could ask:

Are there any Features that we could add that would increase the predicted sales price of the house? (Square Footage, Number of Bedrooms, Overall Condtion of the house, etc.)

About the Data

The Data includes 79 varaibles describing every aspect of residential home with the end goal to predict the final price of each home.

The Dataset consists of real data from 2006 to 2010 and includes 2930 observations (individual houses)

The Data set has been split into equal sized training and test sets by Kaggle. Each dataset consists of 79 predictors, an ID and a column for the sale price of the house.

Potential Predicting Variables include:

  • Square Footage
  • Location of the Neighborhood
  • Overall Condtion of the house (1-10), with 10 being Very Excellent to 1 being Very Poor
  • Year Built
  • Type of Heating (Floor Furance, Gas, etc.)
  • Number of Full and Half baths

Pre-Processing and Data Exploration

  • Lets look/read at the Data and check for missing values and update the data

Import Libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
train_ = pd.read_csv('train.csv')
train_.head()

Check the size of the data

train_.shape

Output:

(1460, 81)

Lets observe the Columns and check for which columns have missing data so we can remove them

train_.columns

Output:

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')

We can check for the percent of missing values/data as a percentage (NULL) by summing up each column in our dataframe and then dividing by the size of column (numpy.shape). Then we can sort.

Example:

For the PoolQC column, which stands for the Pool Quality:

   Ex	Excellent
   Gd	Good
   TA	Average/Typical
   Fa	Fair
   NA	No Pool
train_["PoolQC"].shape

Output:

(1460,)

Which means there are 1460 rows

train_["PoolQC"].isnull().sum()

Output:

1453

Which means that there are currenlty 1453 missing (NULL) data values in this column so, (1453/1460)= 0.995205

We can perform this check on all the columns in the dataframe below:

(train_.isnull().sum()/train_.shape[0]).sort_values(ascending=False)

Output:

PoolQC           0.995205
MiscFeature      0.963014
Alley            0.937671
Fence            0.807534
FireplaceQu      0.472603
LotFrontage      0.177397
GarageCond       0.055479
GarageType       0.055479
GarageYrBlt      0.055479
GarageFinish     0.055479
GarageQual       0.055479
BsmtExposure     0.026027
BsmtFinType2     0.026027
BsmtFinType1     0.025342
BsmtCond         0.025342
BsmtQual         0.025342
MasVnrArea       0.005479
MasVnrType       0.005479
Electrical       0.000685
Utilities        0.000000
YearRemodAdd     0.000000
MSSubClass       0.000000
Foundation       0.000000
ExterCond        0.000000
ExterQual        0.000000
Exterior2nd      0.000000
Exterior1st      0.000000
RoofMatl         0.000000
RoofStyle        0.000000
YearBuilt        0.000000
LotConfig        0.000000
OverallCond      0.000000
OverallQual      0.000000
HouseStyle       0.000000
BldgType         0.000000
Condition2       0.000000
BsmtFinSF1       0.000000
MSZoning         0.000000
LotArea          0.000000
Street           0.000000
Condition1       0.000000
Neighborhood     0.000000
LotShape         0.000000
LandContour      0.000000
LandSlope        0.000000
SalePrice        0.000000
HeatingQC        0.000000
BsmtFinSF2       0.000000
EnclosedPorch    0.000000
Fireplaces       0.000000
GarageCars       0.000000
GarageArea       0.000000
PavedDrive       0.000000
WoodDeckSF       0.000000
OpenPorchSF      0.000000
3SsnPorch        0.000000
BsmtUnfSF        0.000000
ScreenPorch      0.000000
PoolArea         0.000000
MiscVal          0.000000
MoSold           0.000000
YrSold           0.000000
SaleType         0.000000
Functional       0.000000
TotRmsAbvGrd     0.000000
KitchenQual      0.000000
KitchenAbvGr     0.000000
BedroomAbvGr     0.000000
HalfBath         0.000000
FullBath         0.000000
BsmtHalfBath     0.000000
BsmtFullBath     0.000000
GrLivArea        0.000000
LowQualFinSF     0.000000
2ndFlrSF         0.000000
1stFlrSF         0.000000
CentralAir       0.000000
SaleCondition    0.000000
Heating          0.000000
TotalBsmtSF      0.000000
Id               0.000000
dtype: float64

Now we can decide which values we want to fix (numerical) and what columns we want to drop from our dataframe

Lets drop that columns that are missing more than 20% of the data

update_train_df = train_.dropna(axis='columns', thresh=train_.shape[0]*0.80)
(update_train_df.isnull().sum().sort_values(ascending=False)/update_train_df.shape[0])[:15]

Output:

GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageCond      0.055479
GarageQual      0.055479
BsmtExposure    0.026027
BsmtFinType2    0.026027
BsmtFinType1    0.025342
BsmtCond        0.025342
BsmtQual        0.025342
MasVnrType      0.005479
MasVnrArea      0.005479
Electrical      0.000685
RoofMatl        0.000000
RoofStyle       0.000000
dtype: float64

Lets look at the sale price Predictor and see the statistics (Median,Min, Max, etc. )

update_train_df['SalePrice'].describe()

Output:

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

We can see the min house price prediction is $34,900 and the max house price prediction on this dataset is $755,000.

Lets plot the SalePrice

sns.distplot(update_train_df['SalePrice'], color = 'g')

"Insert Image"

Let check which columns are numerical and categorical

update_train_df.dtypes

Output:

Id                 int64
MSSubClass         int64
MSZoning          object
LotArea            int64
Street            object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
BsmtQual          object
BsmtCond          object
BsmtExposure      object
BsmtFinType1      object
BsmtFinSF1         int64
BsmtFinType2      object
BsmtFinSF2         int64
BsmtUnfSF          int64
TotalBsmtSF        int64
Heating           object
HeatingQC         object
CentralAir        object
Electrical        object
1stFlrSF           int64
2ndFlrSF           int64
LowQualFinSF       int64
GrLivArea          int64
BsmtFullBath       int64
BsmtHalfBath       int64
FullBath           int64
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
GarageType        object
GarageYrBlt      float64
GarageFinish      object
GarageCars         int64
GarageArea         int64
GarageQual        object
GarageCond        object
PavedDrive        object
WoodDeckSF         int64
OpenPorchSF        int64
EnclosedPorch      int64
3SsnPorch          int64
ScreenPorch        int64
PoolArea           int64
MiscVal            int64
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
dtype: object

object will represent categorical while int64 will represent numerical datatype

Lets plot the relationship of the sales price with all the categorical data

from pandas.api.types import is_object_dtype

for (column_name, column_data) in update_train_df.iteritems():
    if is_object_dtype(column_data):
        print(column_name, ' categories: ', column_data.astype("category").cat.categories)
for (column_name, column_data) in update_train_df.iteritems():
    if is_object_dtype(column_data):
        
        
        data = pd.concat([update_train_df['SalePrice'], column_data], axis=1)
        
        f, ax = plt.subplots(figsize=(16, 8))
        fig = sns.boxplot(x=column_name, y="SalePrice", data=data)
        fig.axis(ymin=0, ymax=800000);
        
        ax.figure.savefig('SalePrice '+ column_name +'.png')

The code here will save .png file of every category relationship with the salesprice

There are many graph outputs so I’ll only show a few examples!

Output:

SalePrice vs Neighborhood

"insert"

SalePrice vs HouseStyle

"insert"

We can also plot the relationship of the sales price with all the Numerical data

from pandas.api.types import is_numeric_dtype

for (column_name, column_data) in update_train_df.iteritems():
    if is_numeric_dtype(column_data):
        data = pd.concat([update_train_df['SalePrice'], column_data], axis=1)
        data.plot.scatter(x=column_name, y='SalePrice', ylim=(0,800000));
        
        ax.figure.savefig('SalePrice'+ column_name +'.png')

Again there are many graph outputs so I’ll only show a few examples!

Output:

SalePrice vs Paved Driveway

"insert"

SalePrice vs Year Build "insert"

Let create a Heat Map to show correlation between each feature and SalePrice

correlation_matrix = update_train_df.corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(correlation_matrix,  square=True);

Output:

"insert"

  • Still working on, busy right now…..

Analysis

Different Algorithms Performed

Linear Regression

Ridge Regression

LASSO Regression

Random Forest Regression

Summary and Conclusion