Machine Learning Example Using Housing Prediction
Housing Project
About Project
Housing Prediction on Dataset from Kaggle using housing data from Armes,Iowa.
Here is the Link to the Dataset to Download on Kaggle
The Purpose of this project is to use different machine learning algorithms to predict the price of the house to sell.
Questions we could ask:
Are there any Features that we could add that would increase the predicted sales price of the house? (Square Footage, Number of Bedrooms, Overall Condtion of the house, etc.)
About the Data
The Data includes 79 varaibles describing every aspect of residential home with the end goal to predict the final price of each home.
The Dataset consists of real data from 2006 to 2010 and includes 2930 observations (individual houses)
The Data set has been split into equal sized training and test sets by Kaggle. Each dataset consists of 79 predictors, an ID and a column for the sale price of the house.
Potential Predicting Variables include:
- Square Footage
- Location of the Neighborhood
- Overall Condtion of the house (1-10), with 10 being Very Excellent to 1 being Very Poor
- Year Built
- Type of Heating (Floor Furance, Gas, etc.)
- Number of Full and Half baths
Pre-Processing and Data Exploration
- Lets look/read at the Data and check for missing values and update the data
Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
train_ = pd.read_csv('train.csv')
train_.head()
Check the size of the data
train_.shape
Output:
(1460, 81)
Lets observe the Columns and check for which columns have missing data so we can remove them
train_.columns
Output:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
'SaleCondition', 'SalePrice'],
dtype='object')
We can check for the percent of missing values/data as a percentage (NULL) by summing up each column in our dataframe and then dividing by the size of column (numpy.shape). Then we can sort.
Example:
For the PoolQC column, which stands for the Pool Quality:
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
train_["PoolQC"].shape
Output:
(1460,)
Which means there are 1460 rows
train_["PoolQC"].isnull().sum()
Output:
1453
Which means that there are currenlty 1453 missing (NULL) data values in this column so, (1453/1460)= 0.995205
We can perform this check on all the columns in the dataframe below:
(train_.isnull().sum()/train_.shape[0]).sort_values(ascending=False)
Output:
PoolQC 0.995205
MiscFeature 0.963014
Alley 0.937671
Fence 0.807534
FireplaceQu 0.472603
LotFrontage 0.177397
GarageCond 0.055479
GarageType 0.055479
GarageYrBlt 0.055479
GarageFinish 0.055479
GarageQual 0.055479
BsmtExposure 0.026027
BsmtFinType2 0.026027
BsmtFinType1 0.025342
BsmtCond 0.025342
BsmtQual 0.025342
MasVnrArea 0.005479
MasVnrType 0.005479
Electrical 0.000685
Utilities 0.000000
YearRemodAdd 0.000000
MSSubClass 0.000000
Foundation 0.000000
ExterCond 0.000000
ExterQual 0.000000
Exterior2nd 0.000000
Exterior1st 0.000000
RoofMatl 0.000000
RoofStyle 0.000000
YearBuilt 0.000000
LotConfig 0.000000
OverallCond 0.000000
OverallQual 0.000000
HouseStyle 0.000000
BldgType 0.000000
Condition2 0.000000
BsmtFinSF1 0.000000
MSZoning 0.000000
LotArea 0.000000
Street 0.000000
Condition1 0.000000
Neighborhood 0.000000
LotShape 0.000000
LandContour 0.000000
LandSlope 0.000000
SalePrice 0.000000
HeatingQC 0.000000
BsmtFinSF2 0.000000
EnclosedPorch 0.000000
Fireplaces 0.000000
GarageCars 0.000000
GarageArea 0.000000
PavedDrive 0.000000
WoodDeckSF 0.000000
OpenPorchSF 0.000000
3SsnPorch 0.000000
BsmtUnfSF 0.000000
ScreenPorch 0.000000
PoolArea 0.000000
MiscVal 0.000000
MoSold 0.000000
YrSold 0.000000
SaleType 0.000000
Functional 0.000000
TotRmsAbvGrd 0.000000
KitchenQual 0.000000
KitchenAbvGr 0.000000
BedroomAbvGr 0.000000
HalfBath 0.000000
FullBath 0.000000
BsmtHalfBath 0.000000
BsmtFullBath 0.000000
GrLivArea 0.000000
LowQualFinSF 0.000000
2ndFlrSF 0.000000
1stFlrSF 0.000000
CentralAir 0.000000
SaleCondition 0.000000
Heating 0.000000
TotalBsmtSF 0.000000
Id 0.000000
dtype: float64
Now we can decide which values we want to fix (numerical) and what columns we want to drop from our dataframe
Lets drop that columns that are missing more than 20% of the data
update_train_df = train_.dropna(axis='columns', thresh=train_.shape[0]*0.80)
(update_train_df.isnull().sum().sort_values(ascending=False)/update_train_df.shape[0])[:15]
Output:
GarageType 0.055479
GarageYrBlt 0.055479
GarageFinish 0.055479
GarageCond 0.055479
GarageQual 0.055479
BsmtExposure 0.026027
BsmtFinType2 0.026027
BsmtFinType1 0.025342
BsmtCond 0.025342
BsmtQual 0.025342
MasVnrType 0.005479
MasVnrArea 0.005479
Electrical 0.000685
RoofMatl 0.000000
RoofStyle 0.000000
dtype: float64
Lets look at the sale price Predictor and see the statistics (Median,Min, Max, etc. )
update_train_df['SalePrice'].describe()
Output:
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
We can see the min house price prediction is $34,900 and the max house price prediction on this dataset is $755,000.
Lets plot the SalePrice
sns.distplot(update_train_df['SalePrice'], color = 'g')
Let check which columns are numerical and categorical
update_train_df.dtypes
Output:
Id int64
MSSubClass int64
MSZoning object
LotArea int64
Street object
LotShape object
LandContour object
Utilities object
LotConfig object
LandSlope object
Neighborhood object
Condition1 object
Condition2 object
BldgType object
HouseStyle object
OverallQual int64
OverallCond int64
YearBuilt int64
YearRemodAdd int64
RoofStyle object
RoofMatl object
Exterior1st object
Exterior2nd object
MasVnrType object
MasVnrArea float64
ExterQual object
ExterCond object
Foundation object
BsmtQual object
BsmtCond object
BsmtExposure object
BsmtFinType1 object
BsmtFinSF1 int64
BsmtFinType2 object
BsmtFinSF2 int64
BsmtUnfSF int64
TotalBsmtSF int64
Heating object
HeatingQC object
CentralAir object
Electrical object
1stFlrSF int64
2ndFlrSF int64
LowQualFinSF int64
GrLivArea int64
BsmtFullBath int64
BsmtHalfBath int64
FullBath int64
HalfBath int64
BedroomAbvGr int64
KitchenAbvGr int64
KitchenQual object
TotRmsAbvGrd int64
Functional object
Fireplaces int64
GarageType object
GarageYrBlt float64
GarageFinish object
GarageCars int64
GarageArea int64
GarageQual object
GarageCond object
PavedDrive object
WoodDeckSF int64
OpenPorchSF int64
EnclosedPorch int64
3SsnPorch int64
ScreenPorch int64
PoolArea int64
MiscVal int64
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
dtype: object
object will represent categorical while int64 will represent numerical datatype
Lets plot the relationship of the sales price with all the categorical data
from pandas.api.types import is_object_dtype
for (column_name, column_data) in update_train_df.iteritems():
if is_object_dtype(column_data):
print(column_name, ' categories: ', column_data.astype("category").cat.categories)
for (column_name, column_data) in update_train_df.iteritems():
if is_object_dtype(column_data):
data = pd.concat([update_train_df['SalePrice'], column_data], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=column_name, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
ax.figure.savefig('SalePrice '+ column_name +'.png')
The code here will save .png file of every category relationship with the salesprice
There are many graph outputs so I’ll only show a few examples!
Output:
SalePrice vs Neighborhood
SalePrice vs HouseStyle
We can also plot the relationship of the sales price with all the Numerical data
from pandas.api.types import is_numeric_dtype
for (column_name, column_data) in update_train_df.iteritems():
if is_numeric_dtype(column_data):
data = pd.concat([update_train_df['SalePrice'], column_data], axis=1)
data.plot.scatter(x=column_name, y='SalePrice', ylim=(0,800000));
ax.figure.savefig('SalePrice'+ column_name +'.png')
Again there are many graph outputs so I’ll only show a few examples!
Output:
SalePrice vs Paved Driveway
SalePrice vs Year Build
Let create a Heat Map to show correlation between each feature and SalePrice
correlation_matrix = update_train_df.corr()
f, ax = plt.subplots(figsize=(15, 12))
sns.heatmap(correlation_matrix, square=True);
Output:
- Still working on, busy right now…..