SQL Project (ETL)
Project Scope
- Given a bunch of Ebay data about Sellers and other information in JSON files, we shall create a E/R Diagram. From the E/R diagram to parse the JSON files into .dat files which we can create a Database. Using the database we shall execute different queries.
What We Will Learn
-
Creating an E/R Diagram Design
-
Turn an E/R Diagram in to a Database
-
How to Parse JSON files in Python
-
Writing Shell Scipts to automate
Files to download:
Create E/R Diagram
- There is a ton of helpful online tutorials on building E/R Diagrams! (Enity/ Relationships)
Tables:
Users(UserID, Rating, Country, Location)
Bids(ItemID, UserID, Time, Amount)
Items(ItemID, SellerID, Buy_Price, First_Bid, Currently, Number_of_Bids, Started, Ends, Description)
Categories(ItemID, Category_Name)
Reviewing the JSON Layout
-
There are a ton of decent videos on youtube explaining how to parse JSON files using python
-
JSON Files are Object
-
Review the .txt file given in the describing the JSON layout
Parser Script
Layout of JSON Files looks like this in JSON Object Form, we have to iterate through the JSON files and extract the needed/required data into .dat files.
Required .dat files:
* Users.dat
* Bids.dat
* Items.dat
* Categories.dat
- Write a Python program to iterate through the JSON files based on E/R Diagram from above and place data into .dat files.
import sys
from json import loads
from re import sub
columnSeparator = "|"
# Dictionary of months used for date transformation
MONTHS = {'Jan':'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06',\
'Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}
"""
Returns true if a file ends in .json
"""
def isJson(f):
return len(f) > 5 and f[-5:] == '.json'
"""
Converts month to a number, e.g. 'Dec' to '12'
"""
def transformMonth(mon):
if mon in MONTHS:
return MONTHS[mon]
else:
return mon
"""
Transforms a timestamp from Mon-DD-YY HH:MM:SS to YYYY-MM-DD HH:MM:SS
"""
def transformDttm(dttm):
dttm = dttm.strip().split(' ')
dt = dttm[0].split('-')
date = '20' + dt[2] + '-'
date += transformMonth(dt[0]) + '-' + dt[1]
return date + ' ' + dttm[1]
"""
Transform a dollar value amount from a string like $3,453.23 to XXXXX.xx
"""
def transformDollar(money):
if money == None or len(money) == 0:
return money
return sub(r'[^\d.]', '', money)
"""
Parses a single json file. Currently, there's a loop that iterates over each
item in the data set. Your job is to extend this functionality to create all
of the necessary SQL tables for your database.
"""
def parseJson(json_file):
with open(json_file, 'r') as f:
items = loads(f.read())['Items'] # creates a Python dictionary of Items for the supplied json file
f1 = open("EbayUsers.dat", "a")
f2 = open("Bids.dat", "a")
f3 = open("Categories.dat", "a")
f4 = open("Items.dat", "a")
for item in items:
"""
traverse the items dictionary to extract information from the
given `json_file' and generate the necessary .dat files to generate
the SQL tables based on your relation design
"""
# Get EbayUser data:
# Firstly get the bidder user info:
Bids = item["Bids"]
if not Bids == None:
for i in range(len(Bids)):
bidder = Bids[i]["Bid"]["Bidder"]
if not bidder["UserID"] == None:
info = "" + '\"' + sub(r'\"','\"\"',bidder["UserID"]) + '\"'
else:
info = "" + "NULL"
if not bidder["Rating"] == None:
info += columnSeparator + bidder["Rating"]
else:
info += columnSeparator + "NULL"
if not "Location" in bidder.keys() or bidder["Location"] == None:
info += columnSeparator + "NULL"
else:
info += columnSeparator + '\"' + sub(r'\"','\"\"',bidder["Location"]) + '\"'
if not "Country" in bidder.keys() or bidder["Country"] == None:
info += columnSeparator + "NULL"
else:
info += columnSeparator + '\"' + sub(r'\"','\"\"',bidder["Country"]) + '\"'
info += "\n"
f1.write(info)
# Get the seller user info:
Seller = item["Seller"]
SellerID = Seller["UserID"]
if SellerID == None:
SellerID = "NULL"
else:
SellerID = '\"' + sub(r'\"','\"\"',SellerID) + '\"'
info = "" + SellerID
if not Seller["Rating"] == None:
info += columnSeparator + Seller["Rating"]
else:
info += columnSeparator + "NULL"
if not item["Location"] == None:
info += columnSeparator + '\"' + sub(r'\"','\"\"',item["Location"]) + '\"'
else:
info += columnSeparator + "NULL"
if not item["Country"] == None:
info += columnSeparator + '\"' + sub(r'\"','\"\"',item["Country"]) + '\"' + "\n"
else:
info += columnSeparator +"NULL" + "\n"
f1.write(info)
# Get Bids data:
Bids = item["Bids"]
if not Bids == None:
for i in range(len(Bids)):
if not item["ItemID"] == None:
info = "" + item["ItemID"]
else:
info = "" + "NULL"
bid = Bids[i]["Bid"]
if not bid["Bidder"]["UserID"] == None:
info += columnSeparator + '\"' + sub(r'\"','\"\"',bid["Bidder"]["UserID"]) + '\"'
else:
info += columnSeparator + "NULL"
if not bid["Time"] == None:
info += columnSeparator + transformDttm(bid["Time"])
else:
info += columnSeparator + "NULL"
if not bid["Amount"] == None:
info += columnSeparator + transformDollar(bid["Amount"])
else:
info += columnSeparator + "NULL"
info += "\n"
f2.write(info)
# Get Categories data:
Categories = item["Category"]
ItemID = item["ItemID"]
if ItemID == None:
ItemID = "NULL"
for Category in Categories:
info = "" + ItemID + columnSeparator + '\"' + sub(r'\"','\"\"',Category) + '\"' + "\n"
f3.write(info)
# Get Items data:
info = "" + ItemID + "|" + SellerID
if not item["Name"] == None:
info += columnSeparator + '\"' + sub(r'\"','\"\"',item["Name"]) + '\"'
else:
info += columnSeparator + "NULL"
if "Buy_Price" in item.keys():
info += columnSeparator + transformDollar(item["Buy_Price"])
else:
info += columnSeparator + "NULL"
if not item["First_Bid"] == None:
First_Bid = transformDollar(item["First_Bid"])
else:
First_Bid = "NULL"
if not item["Currently"] == None:
Currently = transformDollar(item["Currently"])
else:
Currently = "NULL"
if not item["Number_of_Bids"] == None:
Number_of_Bids = item["Number_of_Bids"]
else:
Number_of_Bids = "NULL"
if not item["Started"] == None:
Started = transformDttm(item["Started"])
else:
Started = "NULL"
if not item["Ends"] == None:
Ends = transformDttm(item["Ends"])
else:
Ends = "NULL"
info += columnSeparator + First_Bid + columnSeparator + Currently
info += columnSeparator + Number_of_Bids + "|" + Started
info += columnSeparator + Ends
if item["Description"] == None:
info += columnSeparator + "NULL" +"\n"
else:
info += columnSeparator + '\"' + sub(r'\"','\"\"',item["Description"]) + '\"' +"\n"
f4.write(info)
f1.close()
f2.close()
f3.close()
f4.close()
"""
Loops through each json files provided on the command line and passes each file
to the parser
"""
def main(argv):
if len(argv) < 2:
print(sys.stderr, 'Usage: python skeleton_json_parser.py <path to json files>')
sys.exit(1)
# loops over all .json files in the argument
for f in argv[1:]:
if isJson(f):
parseJson(f)
print("Success parsing " + f)
if __name__ == '__main__':
main(sys.argv)
Creating Shell Script to Automate Process
What are Shell Scripts?
- They’re scripts used to automate tasks
We will write a shell script in order to automate parsing through all the json files in our ebay data files
JSON_DIR="./ebay_data"
SCRIPT_PATH="./TransformationJson.py"
CATEGORY_TB="./Categories.dat"
ITEMS_TB="./Items.dat"
BIDS_TB="./Bids.dat"
USERS_TB="./EbayUsers.dat"
rm -f $CATEGORY_TB
rm -f $ITEMS_TB
rm -f $BIDS_TB
rm -f $USERS_TB
for i in $(ls $JSON_DIR); do
python $SCRIPT_PATH $JSON_DIR/$i
done
Create Database
We should use our E/R Diagram from above to create a SQL queries to created the needed tables
drop table if exists Items;
drop table if exists EbayUsers;
drop table if exists Bids;
drop table if exists Categories;
create table EbayUsers(
UserID varchar(255),
Rating int,
Location varchar(255),
Country varchar(255),
PRIMARY KEY(UserID)
);
create table Items(
ItemID int,
SellerID varchar(255),
Name varchar(255),
Buy_Price varchar(255),
First_Bid DECIMAL NOT NULL,
Currently int,
Number_of_Bids int,
Started varchar(255),
Ends varchar(255),
Description varchar(255),
PRIMARY KEY(ItemID),
FOREIGN KEY (SellerID) REFERENCES EbayUsers(UserID)
);
create table Bids(
ItemID int,
UserID varchar(255),
Time varchar(255),
Amount int,
PRIMARY KEY(ItemID, UserID, Amount),
FOREIGN KEY (ItemID) REFERENCES Items(ItemID),
FOREIGN KEY (UserID) REFERENCES EbayUsers(UserID)
);
create table Categories(
ItemID int,
Category_Name varchar(255),
PRIMARY KEY(ItemID, Category_Name),
FOREIGN KEY (ItemID) REFERENCES Items(ItemID)
);
We have to adjust some tables in our SQL database to adjust for NULL Values, lets create a .txt file and then we will use it with our shell script to automate later
load.txt
.separator |
.import EbayUsers.dat EbayUsers
update EbayUsers set Location = null where Location = "NULL";
update EbayUsers set Country = null where Country = "NULL";
.import Items.dat Items
update Items set Buy_Price = null where Buy_Price = "NULL";
.import Bids.dat Bids
.import Categories.dat Categories
Automate Importing the SQL
echo "Start======================================================="
sh testParser.sh
sqlite3 AuctionDataBase.db < create.sql
sqlite3 AuctionDataBase.db < load.txt
echo "Testing queries============================================="
answers=(13422 80 8365 1046871451 3130 6717 150)
for i in {1..7}
do
if [ $(sqlite3 ebay.db < query${i}.sql) != ${answers[$(( $i - 1 ))]} ]
then
echo "query $i: failure!"
else
echo "query $i: success!"
fi
done
rm *.dat
Testing our Database
- After following all the steps, we should successfully create a Database named AuctionDataBase.db
To run SQLite3 in bash, all you do is type SQLite3 “insert name of database.db”
Test some scripts below
Count the Numebr of Users in our Database
SELECT COUNT(*)
FROM EbayUsers;
Output:
13422
Select Something from Something
SELECT COUNT(*)
FROM(SELECT COUNT(*) as NUM
FROM Categories
GROUP BY Categories.ItemID)
WHERE NUM = 4;
Output:
8365
Find Number of Users from Grand Rapids
SELECT COUNT(*)
FROM EbayUsers
WHERE Location = "Grand Rapids";
Output:
3
Find all of the Users from Chicago!!!
SELECT UserID, Country, Rating
FROM EbayUsers
WHERE Location = "Chicago";
Output:
2oneals|USA|778
aimtastic@aol.com|USA|978
alexshafran@yahoo.com|USA|255
aurthur|USA|1633
ca_i77@hotmail.com|USA|17
cajual_sounds|USA|174
chicagos-best|USA|10
cjaqtt@core.com|USA|75
dandylan|USA|9
dcrandall|USA|708
dial-1500|USA|45
diver1895|USA|1258
doubled.sr|USA|1481
drpsmom|USA|358
famousbrandnames|USA|410
floridasun23|USA|560
great_chang|USA|171
idiotequi|USA|38
inchicago2000|USA|307
inisor|USA|52
jaygatsby23|USA|390
jenigirl27@hotmail.com|USA|331
joeromine|USA|27
joyce5921|USA|188
maricarol|USA|491
micksandpeps|USA|702
num1inmem|USA|199
pianokeys65|USA|3
picsforall|USA|3777
plasticpuppy@aol.com|USA|1108
puzzle01@hotmail.com|USA|110
rican_jock|USA|0
sell-it|USA|6816
thirty7bolts@mindspring.com|USA|37
trendycollectibles|USA|13674
tuneman|USA|352
xyzxyz|USA|2591
Another Querie to Test
SELECT ItemID
FROM Items
WHERE CURRENTLY =
(SELECT MAX(CURRENTLY)
FROM Items);
Output:
1046871451
View Function
CREATE VIEW [Boston Customers] AS
SELECT UserID, Rating, Location
FROM EbayUsers
WHERE Location = "Boston";
Lets view the View Function
SELECT * FROM [Boston Customers];
Output:
fashionwarehouse2002|138|Boston
foreigner1999|63|Boston
hinckg|246|Boston
jgautographs|3139|Boston
kellimac97|153|Boston
sellstuffsell|36|Boston
sjferzoco@bics.bwh.harvard.edu|368|Boston
tasylvia@mediaone.net|388|Boston
thenewandoverlooked|2713|Boston
webfootma|180|Boston