Big Data Analysis Day 5: Data Quality and Pandas Library

Working with Raw data

What are some Data Quality Issues that we might encounter?

• Noise
• Outliers
• Duplicate Data
• Missing Values

Lets dive into each one…

Noise

Noise refers to the incorrect/modified values of data, some data is very stable and possesses little variabilty, while other data swings wildly and unpredictably from one value to another

How do we deal with Noise?

Dealing with noise depends on the type of data, for time series low pass filters are often used, for document data we can use software or lookup dictionary

Outliers

Are data objects with charactersitics that are considerbly different than most of the other data objects in the data set!

Duplicate Data

• Where data set may include data objects that are duplicates (common problem when merging data from multiple sources)

How do we handle Duplicate Data?

Missing Values

Why is there missing values?

• Information/data wasn’t collected
• Attributes may not apply to all cases

How do we handle missing values?

• We could remove data objects with missing values
• Ignore the missing values during analysis

One of the best Python libraries with Data Analysis is Pandas as we already know

The Pandas Library has two main data structures

1. Series: a one-dimensional array-like object
2. DataFrame: spreadsheet-like data structure containing a collection of ordered columns

Series

Pandas Documentation

Lets use an example of Series

``````from pandas import Series
``````
``````points = Series([40, 24, 25, 29, 32, 33, 45, 54])
points
``````

Output:

``````0    40
1    24
2    25
3    29
4    32
5    33
6    45
7    54
dtype: int64
``````

Where on the Left represents the Index and the right represents the Value and shows the dtype (datatype) which is an integer in this case

``````points.array
``````

Output:

``````<PandasArray>
[40, 24, 25, 29, 32, 33, 45, 54]
Length: 8, dtype: int64
``````
``````points.values
``````

Output:

``````array([40, 24, 25, 29, 32, 33, 45, 54])
``````
``````points.index
``````

Output:

``````RangeIndex(start=0, stop=8, step=1)
``````

We can change the Index from numbers to anything we would like!

Let use an example of the NBA, with timestamp being the index and points being the value and put them together in a Series

``````timestamp = ['1/5/2019', '1/6/2019', '1/8/2019','1/4/2019','1/19/2019', '1/21/2019','1/23/2019','1/30/2019']
points  = [40, 24, 25, 29, 32, 33, 45, 54]

NBA_ = Series(points, index = timestamp)

``````
``````NBA_
``````

Output:

``````1/5/2019     40
1/6/2019     24
1/8/2019     25
1/4/2019     29
1/19/2019    32
1/21/2019    33
1/23/2019    45
1/30/2019    54
dtype: int64
``````

Lets check a random date to see the number of points scored!

``````NBA_['1/8/2019']
``````

Ouput:

``````25
``````

Check the dates when Lebron scored more than 25 Points

``````NBA_[NBA_ > 25]
``````

Ouput:

``````1/5/2019     40
1/4/2019     29
1/19/2019    32
1/21/2019    33
1/23/2019    45
1/30/2019    54
dtype: int64
``````

Lets plot

``````%matplotlib inline
plt.figure(figsize=(10,5))

NBA_.plot(kind = "line", color = 'green')
``````

Output:

Dealing with Missing Values

``````NBA_['1/6/2019'] = None
``````
``````NBA_.count()
``````

Ouput:

``````7
``````
``````NBA_.isnull()
``````

Output:

``````1/5/2019     False
1/6/2019      True
1/8/2019     False
1/4/2019     False
1/19/2019    False
1/21/2019    False
1/23/2019    False
1/30/2019    False
dtype: bool
``````

How would we fill that Null value (missing value) in ?

Well we could use the median value of all the points scored and then fill it in.

Well there is a Pandas function for that (.fillna(value to fill)) and we can find the median value of our points scored and fill it in our missing value during the time

``````NBA_.fillna(NBA_.median())
``````

Output:

``````1/5/2019     40.0
1/6/2019     33.0
1/8/2019     25.0
1/4/2019     29.0
1/19/2019    32.0
1/21/2019    33.0
1/23/2019    45.0
1/30/2019    54.0
dtype: float64
``````

Note: The data 1/6/2019 was filled in with 33 which is the median value of points scored

Another thing that we could do, if we had a “missing” value for example

``````NBA_['1/6/2019'] = None
NBA_
``````

Output:

``````1/5/2019     40.0
1/6/2019      NaN
1/8/2019     25.0
1/4/2019     29.0
1/19/2019    32.0
1/21/2019    33.0
1/23/2019    45.0
1/30/2019    54.0
dtype: float64
``````

We could just remove the date

``````NBA_ = NBA_[NBA_.notnull()]
NBA_
``````

Output:

``````1/5/2019     40.0
1/8/2019     25.0
1/4/2019     29.0
1/19/2019    32.0
1/21/2019    33.0
1/23/2019    45.0
1/30/2019    54.0
dtype: float64
``````

or this way:

``````NBA_ = NBA_.dropna()
NBA_
``````

Output:

``````1/5/2019     40
1/8/2019     25
1/4/2019     29
1/19/2019    32
1/21/2019    33
1/23/2019    45
1/30/2019    54
dtype: object
``````

Either or drops the timestamp (index)!

Lets look at Outliers

``````NBA_['1/6/2019'] = 100
``````
``````NBA_.hist()
``````

Output:

``````std_values = (NBA_ - NBA_.mean())/NBA_.std()
std_points = Series(std_values, index = NBA_.index)
std_points.hist()
``````

Output:

``````std_points[std_points > 2]
``````

Output:

``````1/6/2019    2.28305
dtype: object
``````

DataFrame

Lets work more with DataFrames

``````import pandas as pd
from pandas import DataFrame
``````
``````
data = {'Name': ['Lebron James', 'Steph Curry', 'Kevin Durant','Chris Paul'],
'Age': [36, 31, 34, 36],
'Height': [6.8, 6.2, 6.9, 6.0]}
users = DataFrame(data)

``````
``````users
``````

Output:

``````
Name	            Age	Height
0	Lebron James	36	6.8
1	Steph Curry	    31	6.2
2	Kevin Durant	34	6.9
3	Chris Paul  	36	6.0
``````
``````users["Age"]
``````

Output:

``````0    36
1    31
2    34
3    36
Name: Age, dtype: int64
``````
``````users.columns
``````

Output:

``````Index(['Name', 'Age', 'Height'], dtype='object')
``````
``````users.shape
``````
``````(4, 3)
``````
``````users.index = [1,2,3,4]
users.transpose()
``````

Output:

``````            	1	            2	        3	        4
Name	Lebron James	Steph Curry 	Kevin Durant	Chris Paul
Age	        36          	31	            34	            36
Height  	6.8	            6.2	            6.9         	6
``````
``````users[users.Age>33]
``````

Output:

``````	Name	        Age	    Height
1	Lebron James	36	    6.8
3	Kevin Durant	34	    6.9
4	Chris Paul      36	    6.0
``````
``````users.describe().transpose()
``````

Output:

``````	    count	mean	std	        min	    25%	    50%	    75%	    max
Age     4.0	    34.250	2.362908	31.0	33.25	35.0	36.000	36.0
Height	4.0 	6.475	0.442531	6.0	    6.15	6.5 	6.825	6.9
``````

Updated: