SQL Project (ETL)

7 minute read

Project Scope

  • Given a bunch of Ebay data about Sellers and other information in JSON files, we shall create a E/R Diagram. From the E/R diagram to parse the JSON files into .dat files which we can create a Database. Using the database we shall execute different queries.

What We Will Learn

  • Creating an E/R Diagram Design

  • Turn an E/R Diagram in to a Database

  • How to Parse JSON files in Python

  • Writing Shell Scipts to automate

Files to download:

Ebay Zip Files

Create E/R Diagram

  • There is a ton of helpful online tutorials on building E/R Diagrams! (Enity/ Relationships)

"Insert Image"

Tables:

Users(UserID, Rating, Country, Location)

Bids(ItemID, UserID, Time, Amount)

Items(ItemID, SellerID, Buy_Price, First_Bid, Currently, Number_of_Bids, Started, Ends, Description)

Categories(ItemID, Category_Name)

Reviewing the JSON Layout

  • There are a ton of decent videos on youtube explaining how to parse JSON files using python

  • JSON Files are Object

  • Review the .txt file given in the describing the JSON layout

Parser Script

Json File

Layout of JSON Files looks like this in JSON Object Form, we have to iterate through the JSON files and extract the needed/required data into .dat files.

Required .dat files:

* Users.dat
* Bids.dat
* Items.dat
* Categories.dat

"Insert Image"

  • Write a Python program to iterate through the JSON files based on E/R Diagram from above and place data into .dat files.
import sys
from json import loads
from re import sub

columnSeparator = "|"

# Dictionary of months used for date transformation
MONTHS = {'Jan':'01','Feb':'02','Mar':'03','Apr':'04','May':'05','Jun':'06',\
        'Jul':'07','Aug':'08','Sep':'09','Oct':'10','Nov':'11','Dec':'12'}

"""
Returns true if a file ends in .json
"""
def isJson(f):
    return len(f) > 5 and f[-5:] == '.json'

"""
Converts month to a number, e.g. 'Dec' to '12'
"""
def transformMonth(mon):
    if mon in MONTHS:
        return MONTHS[mon]
    else:
        return mon

"""
Transforms a timestamp from Mon-DD-YY HH:MM:SS to YYYY-MM-DD HH:MM:SS
"""
def transformDttm(dttm):
    dttm = dttm.strip().split(' ')
    dt = dttm[0].split('-')
    date = '20' + dt[2] + '-'
    date += transformMonth(dt[0]) + '-' + dt[1]
    return date + ' ' + dttm[1]

"""
Transform a dollar value amount from a string like $3,453.23 to XXXXX.xx
"""

def transformDollar(money):
    if money == None or len(money) == 0:
        return money
    return sub(r'[^\d.]', '', money)
    
"""
Parses a single json file. Currently, there's a loop that iterates over each
item in the data set. Your job is to extend this functionality to create all
of the necessary SQL tables for your database.
"""
def parseJson(json_file):
    with open(json_file, 'r') as f:
        items = loads(f.read())['Items'] # creates a Python dictionary of Items for the supplied json file
        f1 = open("EbayUsers.dat", "a")
        f2 = open("Bids.dat", "a")
        f3 = open("Categories.dat", "a")
        f4 = open("Items.dat", "a")
        for item in items:
            """
            traverse the items dictionary to extract information from the
            given `json_file' and generate the necessary .dat files to generate
            the SQL tables based on your relation design
            """
            # Get EbayUser data:
            # Firstly get the bidder user info:
            Bids = item["Bids"]
            if not Bids == None: 
                for i in range(len(Bids)):
                    bidder = Bids[i]["Bid"]["Bidder"]
                    if not bidder["UserID"] == None:
                        info = "" + '\"' + sub(r'\"','\"\"',bidder["UserID"]) + '\"'
                    else:
                        info = "" + "NULL"
                    if not bidder["Rating"] == None:
                        info += columnSeparator + bidder["Rating"]
                    else:
                        info += columnSeparator + "NULL"
                    if not "Location" in bidder.keys() or bidder["Location"] == None:
                        info += columnSeparator + "NULL"
                    else:
                        info += columnSeparator + '\"' + sub(r'\"','\"\"',bidder["Location"]) + '\"'
                    if not "Country" in bidder.keys() or bidder["Country"] == None:
                        info += columnSeparator + "NULL"
                    else:
                        info += columnSeparator + '\"' + sub(r'\"','\"\"',bidder["Country"]) + '\"'
                    info += "\n"
                    f1.write(info)
                
            # Get the seller user info:
            Seller = item["Seller"]
            SellerID = Seller["UserID"]
            if SellerID == None:
                SellerID = "NULL"
            else:
                SellerID = '\"' + sub(r'\"','\"\"',SellerID) + '\"'
            info = "" + SellerID
            if not Seller["Rating"] == None:
                info += columnSeparator + Seller["Rating"]
            else:
                info += columnSeparator + "NULL"
            if not item["Location"] == None:
                info += columnSeparator + '\"' + sub(r'\"','\"\"',item["Location"]) + '\"'
            else:
                info += columnSeparator + "NULL"
            if not item["Country"] == None:
                info += columnSeparator + '\"' + sub(r'\"','\"\"',item["Country"]) + '\"' + "\n"
            else:
                info += columnSeparator +"NULL" + "\n"
            f1.write(info)
            
            # Get Bids data:
            Bids = item["Bids"]
            if not Bids == None:
                for i in range(len(Bids)):
                    if not item["ItemID"] == None:
                        info = "" + item["ItemID"]
                    else:
                        info = "" + "NULL"
                    bid = Bids[i]["Bid"]
                    if not bid["Bidder"]["UserID"] == None:
                        info += columnSeparator + '\"' + sub(r'\"','\"\"',bid["Bidder"]["UserID"]) + '\"'
                    else:
                        info += columnSeparator + "NULL"
                    if not bid["Time"] == None:
                        info += columnSeparator + transformDttm(bid["Time"])
                    else:
                        info += columnSeparator + "NULL"
                    if not bid["Amount"] == None:
                        info += columnSeparator + transformDollar(bid["Amount"])
                    else:
                        info += columnSeparator + "NULL"
                    info += "\n"
                    f2.write(info)
                
            # Get Categories data:
            Categories = item["Category"]
            ItemID = item["ItemID"]
            if ItemID == None:
                ItemID = "NULL"
            for Category in Categories:
                info = "" + ItemID + columnSeparator + '\"' + sub(r'\"','\"\"',Category) + '\"' + "\n"
                f3.write(info)
            
            # Get Items data:
            info = "" + ItemID + "|" + SellerID 
            if not item["Name"] == None:
                info += columnSeparator + '\"' + sub(r'\"','\"\"',item["Name"]) + '\"' 
            else:
                info += columnSeparator + "NULL"
            if "Buy_Price" in item.keys():
                info += columnSeparator + transformDollar(item["Buy_Price"])
            else:
                info += columnSeparator + "NULL"
            if not item["First_Bid"] == None:
                First_Bid = transformDollar(item["First_Bid"])
            else:
                First_Bid = "NULL"
            if not item["Currently"] == None:
                Currently =  transformDollar(item["Currently"]) 
            else:
                Currently = "NULL"
            if not item["Number_of_Bids"] == None:
                Number_of_Bids = item["Number_of_Bids"]
            else:
                Number_of_Bids = "NULL"
            if not item["Started"] == None:
                Started = transformDttm(item["Started"])
            else:
                Started = "NULL"
            if not item["Ends"] == None:
                Ends = transformDttm(item["Ends"])
            else:
                Ends = "NULL"
            info += columnSeparator + First_Bid + columnSeparator + Currently
            info += columnSeparator + Number_of_Bids + "|" + Started
            info += columnSeparator + Ends
            if item["Description"] == None:
                info += columnSeparator + "NULL" +"\n"
            else:
                info += columnSeparator + '\"' + sub(r'\"','\"\"',item["Description"]) + '\"' +"\n"
            f4.write(info)
            
        f1.close()
        f2.close()
        f3.close()
        f4.close()

"""
Loops through each json files provided on the command line and passes each file
to the parser

"""
def main(argv):
    if len(argv) < 2:
        print(sys.stderr, 'Usage: python skeleton_json_parser.py <path to json files>')
        sys.exit(1)
    # loops over all .json files in the argument
    for f in argv[1:]:
        if isJson(f):
            parseJson(f)
            print("Success parsing " + f)

if __name__ == '__main__':
    main(sys.argv)

Creating Shell Script to Automate Process

What are Shell Scripts?

  • They’re scripts used to automate tasks

We will write a shell script in order to automate parsing through all the json files in our ebay data files

JSON_DIR="./ebay_data"
SCRIPT_PATH="./TransformationJson.py"
CATEGORY_TB="./Categories.dat"
ITEMS_TB="./Items.dat"
BIDS_TB="./Bids.dat"
USERS_TB="./EbayUsers.dat"
rm -f $CATEGORY_TB
rm -f $ITEMS_TB
rm -f $BIDS_TB
rm -f $USERS_TB
for i in $(ls $JSON_DIR); do
  python $SCRIPT_PATH $JSON_DIR/$i
done

Create Database

We should use our E/R Diagram from above to create a SQL queries to created the needed tables

drop table if exists Items;
drop table if exists EbayUsers;
drop table if exists Bids;
drop table if exists Categories;

create table EbayUsers(
    UserID varchar(255),
    Rating int,
    Location varchar(255),
    Country varchar(255),
    PRIMARY KEY(UserID)
);

create table Items(
    
    ItemID int,
    SellerID varchar(255),
    Name varchar(255),
    Buy_Price varchar(255),
    First_Bid DECIMAL NOT NULL,
    Currently int,
    Number_of_Bids int,
    Started varchar(255),
    Ends varchar(255),
    Description varchar(255),
    PRIMARY KEY(ItemID),
    FOREIGN KEY (SellerID) REFERENCES EbayUsers(UserID)
);

create table Bids(
    ItemID int,
    UserID varchar(255),
    Time varchar(255),
    Amount int,
    PRIMARY KEY(ItemID, UserID, Amount),
    FOREIGN KEY (ItemID) REFERENCES Items(ItemID),
    FOREIGN KEY (UserID) REFERENCES EbayUsers(UserID)
);

create table Categories(
    ItemID int,
    Category_Name varchar(255),
    PRIMARY KEY(ItemID, Category_Name),
    FOREIGN KEY (ItemID) REFERENCES Items(ItemID)
);

We have to adjust some tables in our SQL database to adjust for NULL Values, lets create a .txt file and then we will use it with our shell script to automate later

load.txt

.separator |
.import EbayUsers.dat EbayUsers
update EbayUsers set Location = null where Location = "NULL";
update EbayUsers set Country = null where Country = "NULL";

.import Items.dat Items
update Items set Buy_Price = null where Buy_Price = "NULL";

.import Bids.dat Bids

.import Categories.dat Categories
Automate Importing the SQL
echo "Start======================================================="
sh testParser.sh
sqlite3 AuctionDataBase.db < create.sql
sqlite3 AuctionDataBase.db < load.txt
echo "Testing queries============================================="
answers=(13422 80 8365 1046871451 3130 6717 150)
for i in {1..7}
do
  if [ $(sqlite3 ebay.db < query${i}.sql) != ${answers[$(( $i - 1 ))]} ]
  then
    echo "query $i: failure!"
  else
    echo "query $i: success!"
  fi
done
rm *.dat

Testing our Database

  • After following all the steps, we should successfully create a Database named AuctionDataBase.db

To run SQLite3 in bash, all you do is type SQLite3 “insert name of database.db”

Test some scripts below

Count the Numebr of Users in our Database

SELECT COUNT(*)
FROM EbayUsers;

Output:

13422

Select Something from Something

SELECT COUNT(*)
FROM(SELECT COUNT(*) as NUM
FROM Categories
GROUP BY Categories.ItemID)
WHERE NUM = 4;

Output:

8365

Find Number of Users from Grand Rapids

SELECT COUNT(*)
FROM EbayUsers
WHERE Location = "Grand Rapids";

Output:

3

Find all of the Users from Chicago!!!

SELECT UserID, Country, Rating
FROM EbayUsers
WHERE Location = "Chicago";

Output:

2oneals|USA|778
aimtastic@aol.com|USA|978
alexshafran@yahoo.com|USA|255
aurthur|USA|1633
ca_i77@hotmail.com|USA|17
cajual_sounds|USA|174
chicagos-best|USA|10
cjaqtt@core.com|USA|75
dandylan|USA|9
dcrandall|USA|708
dial-1500|USA|45
diver1895|USA|1258
doubled.sr|USA|1481
drpsmom|USA|358
famousbrandnames|USA|410
floridasun23|USA|560
great_chang|USA|171
idiotequi|USA|38
inchicago2000|USA|307
inisor|USA|52
jaygatsby23|USA|390
jenigirl27@hotmail.com|USA|331
joeromine|USA|27
joyce5921|USA|188
maricarol|USA|491
micksandpeps|USA|702
num1inmem|USA|199
pianokeys65|USA|3
picsforall|USA|3777
plasticpuppy@aol.com|USA|1108
puzzle01@hotmail.com|USA|110
rican_jock|USA|0
sell-it|USA|6816
thirty7bolts@mindspring.com|USA|37
trendycollectibles|USA|13674
tuneman|USA|352
xyzxyz|USA|2591

Another Querie to Test

SELECT ItemID
FROM Items
WHERE CURRENTLY = 
(SELECT MAX(CURRENTLY)
FROM Items);

Output:

1046871451

View Function

CREATE VIEW [Boston Customers] AS
SELECT UserID, Rating, Location
FROM EbayUsers
WHERE Location = "Boston";

Lets view the View Function

SELECT * FROM [Boston Customers];

Output:

fashionwarehouse2002|138|Boston
foreigner1999|63|Boston
hinckg|246|Boston
jgautographs|3139|Boston
kellimac97|153|Boston
sellstuffsell|36|Boston
sjferzoco@bics.bwh.harvard.edu|368|Boston
tasylvia@mediaone.net|388|Boston
thenewandoverlooked|2713|Boston
webfootma|180|Boston