100 Days of Machine Learning
Day One!
Introduction to Regression
What is Regression and how is it used in Machine Learning?
- The goal of “regression” is to take continous data and find an equation that “best” fits the data so that were able to forecast out a specifc value
import pandas as pd
import quandl
lets us the quandl api to get stock information from Apple
df = quandl.get("WIKI/AAPL")
machine learning…..
Day TWO:
Lets use pandas read_csv function to load our file into a data fram object and display the data frame!
import pandas as pd
data = pd.read_csv('Baltimore_crime_data.csv')
data
Output:
CrimeDate CrimeTime CrimeCode Location Description Inside/Outside Weapon Post District Neighborhood Longitude Latitude Location 1 Premise Total Incidents
0 11/04/2017 23:39:00 4E 5700 HAZELWOOD CIR COMMON ASSAULT I HANDS 444.0 NORTHEASTERN Frankford -76.53114 39.33952 (39.3395200000, -76.5311400000) APT/CONDO 1
1 11/04/2017 23:16:00 4E 200 N MOUNT ST COMMON ASSAULT I HANDS 711.0 WESTERN Franklin Square -76.64393 39.29141 (39.2914100000, -76.6439300000) APT/CONDO 1
2 11/04/2017 23:15:00 6C 1100 E NORTH AVE LARCENY I NaN 342.0 EASTERN East Baltimore Midway -76.60333 39.31177 (39.3117700000, -76.6033300000) GROCERY/CO 1
3 11/04/2017 23:15:00 7A 4800 ERDMAN AVE AUTO THEFT O NaN 433.0 NORTHEASTERN Armistead Gardens -76.55972 39.30727 (39.3072700000, -76.5597200000) STREET 1
4 11/04/2017 23:00:00 4E 6400 ELRAY DR COMMON ASSAULT NaN HANDS 632.0 NORTHWESTERN Cheswolde -76.69162 39.36942 (39.3694200000, -76.6916200000) NaN
Now lets create a Pandas DataFrame Object named counts that contains the number of auto thefts, assaults, homicides, and robberies for each district.
from pandas import DataFrame
auto_theft = data[data["Description"]=="AUTO THEFT"]
auto_counts = DataFrame(auto_theft["District"].value_counts().sort_index())
assault = data[data["Description"]=="COMMON ASSAULT"]
assault_counts = DataFrame(assault["District"].value_counts().sort_index())
robbery = data[data["Description"]=="ROBBERY - STREET"]
robbery_counts = DataFrame(robbery["District"].value_counts().sort_index())
homicide = data[data["Description"]=="HOMICIDE"]
homicide_counts = DataFrame(homicide["District"].value_counts().sort_index())
counts = pd.concat([auto_counts,assault_counts,robbery_counts,homicide_counts], axis=1)
# Heres a Count
counts.columns = ['Auto Theft', 'Assault', 'Robbery', 'Homicide']
Lets check out the count object!
counts
Output:
Auto Theft Assault Robbery Homicide
CENTRAL 1797 5387 2696 112
EASTERN 1914 5303 1408 242
NORTHEASTERN 5155 7077 2881 217
NORTHERN 2826 4229 2154 119
NORTHWESTERN 3635 4234 1809 223
SOUTHEASTERN 2737 6376 3002 99
SOUTHERN 3140 5388 1916 136
SOUTHWESTERN 3476 4483 1373 212
WESTERN 2912 4520 1227 264
Now lets plot
counts = DataFrame(data["Description"].value_counts())
import matplotlib
%matplotlib inline
counts.plot(kind='bar')
Output:
Lets create a function that contains a regular expression that parse the CrimeData column in the DataFrame and returns the year of Crime!
import re
def getYear(CrimeDate):
regex = r'/'
fields = re.split(regex, CrimeDate)
return fields[2]
data['Year'] = data['CrimeDate'].apply(getYear)
function to the CrimeDate column
data[['CrimeDate','Year']].head()
Ouput:
CrimeDate Year
0 11/04/2017 2017
1 11/04/2017 2017
2 11/04/2017 2017
3 11/04/2017 2017
4 11/04/2017 2017