Data Quality and Pandas Library
Working with Raw data
What are some Data Quality Issues that we might encounter?
- Noise
- Outliers
- Duplicate Data
- Missing Values
Lets dive into each one…
Noise
Noise refers to the incorrect/modified values of data, some data is very stable and possesses little variabilty, while other data swings wildly and unpredictably from one value to another
How do we deal with Noise?
Dealing with noise depends on the type of data, for time series low pass filters are often used, for document data we can use software or lookup dictionary
Outliers
Are data objects with charactersitics that are considerbly different than most of the other data objects in the data set!
Duplicate Data
- Where data set may include data objects that are duplicates (common problem when merging data from multiple sources)
How do we handle Duplicate Data?
Missing Values
Why is there missing values?
- Information/data wasn’t collected
- Attributes may not apply to all cases
How do we handle missing values?
- We could remove data objects with missing values
- Ignore the missing values during analysis
One of the best Python libraries with Data Analysis is Pandas as we already know
The Pandas Library has two main data structures
- Series: a one-dimensional array-like object
- DataFrame: spreadsheet-like data structure containing a collection of ordered columns
Series
Lets use an example of Series
from pandas import Series
points = Series([40, 24, 25, 29, 32, 33, 45, 54])
points
Output:
0 40
1 24
2 25
3 29
4 32
5 33
6 45
7 54
dtype: int64
Where on the Left represents the Index and the right represents the Value and shows the dtype (datatype) which is an integer in this case
points.array
Output:
<PandasArray>
[40, 24, 25, 29, 32, 33, 45, 54]
Length: 8, dtype: int64
points.values
Output:
array([40, 24, 25, 29, 32, 33, 45, 54])
points.index
Output:
RangeIndex(start=0, stop=8, step=1)
We can change the Index from numbers to anything we would like!
Let use an example of the NBA, with timestamp being the index and points being the value and put them together in a Series
timestamp = ['1/5/2019', '1/6/2019', '1/8/2019','1/4/2019','1/19/2019', '1/21/2019','1/23/2019','1/30/2019']
points = [40, 24, 25, 29, 32, 33, 45, 54]
NBA_ = Series(points, index = timestamp)
NBA_
Output:
1/5/2019 40
1/6/2019 24
1/8/2019 25
1/4/2019 29
1/19/2019 32
1/21/2019 33
1/23/2019 45
1/30/2019 54
dtype: int64
Lets check a random date to see the number of points scored!
NBA_['1/8/2019']
Ouput:
25
Check the dates when Lebron scored more than 25 Points
NBA_[NBA_ > 25]
Ouput:
1/5/2019 40
1/4/2019 29
1/19/2019 32
1/21/2019 33
1/23/2019 45
1/30/2019 54
dtype: int64
Lets plot
%matplotlib inline
plt.figure(figsize=(10,5))
NBA_.plot(kind = "line", color = 'green')
Output:
Dealing with Missing Values
NBA_['1/6/2019'] = None
NBA_.count()
Ouput:
7
NBA_.isnull()
Output:
1/5/2019 False
1/6/2019 True
1/8/2019 False
1/4/2019 False
1/19/2019 False
1/21/2019 False
1/23/2019 False
1/30/2019 False
dtype: bool
How would we fill that Null value (missing value) in ?
Well we could use the median value of all the points scored and then fill it in.
Well there is a Pandas function for that (.fillna(value to fill)) and we can find the median value of our points scored and fill it in our missing value during the time
NBA_.fillna(NBA_.median())
Output:
1/5/2019 40.0
1/6/2019 33.0
1/8/2019 25.0
1/4/2019 29.0
1/19/2019 32.0
1/21/2019 33.0
1/23/2019 45.0
1/30/2019 54.0
dtype: float64
Note: The data 1/6/2019 was filled in with 33 which is the median value of points scored
Another thing that we could do, if we had a “missing” value for example
NBA_['1/6/2019'] = None
NBA_
Output:
1/5/2019 40.0
1/6/2019 NaN
1/8/2019 25.0
1/4/2019 29.0
1/19/2019 32.0
1/21/2019 33.0
1/23/2019 45.0
1/30/2019 54.0
dtype: float64
We could just remove the date
NBA_ = NBA_[NBA_.notnull()]
NBA_
Output:
1/5/2019 40.0
1/8/2019 25.0
1/4/2019 29.0
1/19/2019 32.0
1/21/2019 33.0
1/23/2019 45.0
1/30/2019 54.0
dtype: float64
or this way:
NBA_ = NBA_.dropna()
NBA_
Output:
1/5/2019 40
1/8/2019 25
1/4/2019 29
1/19/2019 32
1/21/2019 33
1/23/2019 45
1/30/2019 54
dtype: object
Either or drops the timestamp (index)!
Lets look at Outliers
NBA_['1/6/2019'] = 100
NBA_.hist()
Output:
std_values = (NBA_ - NBA_.mean())/NBA_.std()
std_points = Series(std_values, index = NBA_.index)
std_points.hist()
Output:
std_points[std_points > 2]
Output:
1/6/2019 2.28305
dtype: object
DataFrame
Lets work more with DataFrames
import pandas as pd
from pandas import DataFrame
data = {'Name': ['Lebron James', 'Steph Curry', 'Kevin Durant','Chris Paul'],
'Age': [36, 31, 34, 36],
'Height': [6.8, 6.2, 6.9, 6.0]}
users = DataFrame(data)
users
Output:
Name Age Height
0 Lebron James 36 6.8
1 Steph Curry 31 6.2
2 Kevin Durant 34 6.9
3 Chris Paul 36 6.0
users["Age"]
Output:
0 36
1 31
2 34
3 36
Name: Age, dtype: int64
users.columns
Output:
Index(['Name', 'Age', 'Height'], dtype='object')
users.shape
(4, 3)
users.index = [1,2,3,4]
users.transpose()
Output:
1 2 3 4
Name Lebron James Steph Curry Kevin Durant Chris Paul
Age 36 31 34 36
Height 6.8 6.2 6.9 6
users[users.Age>33]
Output:
Name Age Height
1 Lebron James 36 6.8
3 Kevin Durant 34 6.9
4 Chris Paul 36 6.0
users.describe().transpose()
Output:
count mean std min 25% 50% 75% max
Age 4.0 34.250 2.362908 31.0 33.25 35.0 36.000 36.0
Height 4.0 6.475 0.442531 6.0 6.15 6.5 6.825 6.9