Practice Data Page

So you want to create a DB using an xlxs file, Python, and SQLite??

SQL CitiBike Notes

I took a csv file from here!

  1. Lets First Create the DataBase

We’re going to use the columns in the csv file and name the Table, CitiBikes

import sqlite3, csv

conn = sqlite3.connect('citi.db')
cur = conn.cursor()


sql = """
    CREATE TABLE CitiBikes (
        Trip_Duration INTEGER,
        StartTime TEXT,
        StopTime TEXT,
        StartStationID INTEGER,
        StartStationName TEXT,
        StartStationLat DECIMAL,
        StartStationLong DECIMAL,
        EndStationID INTEGER,
        EndStationName TEXT,
        EndStationLat DECIMAL,
        EndStationlog DECIMAL,
        t INTEGAR,
        Usertype TEXT,
        Birthyear TEXT,
        Gender INTEGER
        
        )
    """
    
cur.execute(sql)
print('Database has been created')

conn.commit()
conn.close()
    

Now we have sucessfully created a Database named citi.db

Now let import our csv file and the columns into our newly created database.

import sqlite3, csv, pandas

connection  = sqlite3.connect('citi.db')
cursor = connection.cursor()

with open('citibike.csv', 'r') as file:
    no_records = 0
    for row in file:
        cursor.execute("INSERT INTO CitiBikes VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", row.split(","))
        
        connection.commit()
        no_records +=1
        
connection.close()
print('\n{} Records Transferred'.format(no_records))

print("DONE!")


Next time we can do some simple data analysis!

Lets import some libraries

import sqlite3, pandas

import matplotlib.pyplot as plt

Lets connect to our Database:

db = sqlite3.connect('citi.db')

Lets Execute some Simple Queries SQLite using Pandas


pandas.read_sql_query("SELECT COUNT (*) FROM CitiBikes", db)

-> this will return the number of rows from our table CitiBikes

Example of output:

	COUNT (*)
0	1155406

Lets SELECT one column from our table:

pandas.read_sql_query("SELECT Trip_Duration FROM CitiBikes", db)

Example of potential output:

	Trip_Duration
0	695
1	693
2	2059
3	123
4	1521
...	...
1155401	925
1155402	279
1155403	161
1155404	909
1155405	634

1155406 rows × 1 columns


Lets Calculate the average Trip Duration (in seconds)

pandas.read_sql_query("SELECT avg(Trip_Duration) from CitiBikes", db)

Example Output:

	AVG(Trip_Duration)
    
0	1372.570279

The average Trip Duration for this dataset is 22.86 minutes!

Lets add a helper function to make it easier on us when executing Queries

def Q(sql):
    res = pandas.read_sql_query(sql, db, chunksize = 100_000)
    return next(res)

The chunksize is the number of rows we exectute at a time, the next(res) calls for the next chunksize, so 100,000 rows at a time!

res = Q("SELECT * FROM CitiBikes")
res

Awesome, now heres our output, were working with 15 Columns and 100,000 rows at a time!

	Trip_Duration	StartTime	StopTime	StartStationID	StartStationName	StartStationLat	StartStationLong	EndStationID	EndStationName	EndStationLat	EndStationlog	t	Usertype	Birthyear	Gender
0	695	"2013-06-01 00:00:01"	"2013-06-01 00:11:36"	444	"Broadway & W 24 St"	40.742354	-73.989151	434	"9 Ave & W 18 St"	40.7432	-74.0037	19678	"Subscriber"	1983	1
1	693	"2013-06-01 00:00:08"	"2013-06-01 00:11:41"	444	"Broadway & W 24 St"	40.742354	-73.989151	434	"9 Ave & W 18 St"	40.7432	-74.0037	16649	"Subscriber"	1984	1
2	2059	"2013-06-01 00:00:44"	"2013-06-01 00:35:03"	406	"Hicks St & Montague St"	40.695128	-73.995951	406	"Hicks St & Montague St"	40.6951	-73.996	19599	"Customer"	NULL	0
3	123	"2013-06-01 00:01:04"	"2013-06-01 00:03:07"	475	"E 15 St & Irving Pl"	40.735243	-73.987586	262	"Washington Park"	40.6918	-73.9737	16352	"Subscriber"	1960	1
4	1521	"2013-06-01 00:01:22"	"2013-06-01 00:26:43"	2008	"Little West St & 1 Pl"	40.705693	-74.016777	310	"State St & Smith St"	40.6893	-73.9891	15567	"Subscriber"	1983	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
99995	312	"2013-06-09 12:12:18"	"2013-06-09 12:17:30"	476	"E 31 St & 3 Ave"	40.743943	-73.979661	509	"9 Ave & W 22 St"	40.7455	-74.002	15509	"Subscriber"	1984	1
99996	1177	"2013-06-09 12:12:24"	"2013-06-09 12:32:01"	445	"E 10 St & Avenue A"	40.727408	-73.981420	477	"W 41 St & 8 Ave"	40.7564	-73.99	20242	"Subscriber"	1964	1
99997	2110	"2013-06-09 12:12:24"	"2013-06-09 12:47:34"	426	"West St & Chambers St"	40.717548	-74.013221	404	"9 Ave & W 14 St"	40.7406	-74.0055	17437	"Subscriber"	1996	1
99998	1058	"2013-06-09 12:12:25"	"2013-06-09 12:30:03"	521	"8 Ave & W 31 St N"	40.750967	-73.994442	428	"E 3 St & 1 Ave"	40.7247	-73.9878	15407	"Subscriber"	1978	1
99999	1870	"2013-06-09 12:12:29"	"2013-06-09 12:43:39"	419	"Carlton Ave & Park Ave"	40.695807	-73.973556	502	"Henry St & Grand St"	40.7142	-73.9813	17989	"Customer"	NULL	0
100000 rows × 15 columns

More Simple SQL Queries

Lets check what Columns are in our Table (CitiBikes)

res.columns

Output:

Index(['Trip_Duration', 'StartTime', 'StopTime', 'StartStationID',
       'StartStationName', 'StartStationLat', 'StartStationLong',
       'EndStationID', 'EndStationName', 'EndStationLat', 'EndStationlog', 't',
       'Usertype', 'Birthyear', 'Gender'],
      dtype='object')

Ok, lets see what Gender is riding the Bikes

  • 0: Unknown Gender
  • 1: Male
  • 2: Female
res = Q("SELECT Gender, COUNT (*) FROM CitiBikes GROUP BY Gender")
res

Output:

	Gender	COUNT(*)
0	0	481496
1	1	526984
2	2	146926

Lets look at the User Type, which is seperated into:

Customer = 24-hour pass or 3-day pass user (tourist)

Subscriber = Annual Member

Lets Group them by UserType and Gender!

res = Q("SELECT gender, UserType, COUNT(*) from CitiBikes GROUP BY Gender, UserType ")

Output

	Gender	UserType	COUNT(*)
0	0	"Customer"	    480638
1	0	"Subscriber"	858
2	1	"Subscriber"	526984
3	2	"Subscriber"	146926

Ok now that we have the hang of it, lets do some more complex Queries!

Lets find the Ages of the Bike riders!

res = Q("""
 SELECT
(2018 - Birthyear) as Age,
COUNT(*) 
FROM CitiBikes
GROUP BY Age

""")
res

Output

	Age	COUNT(*)
0	21	396
1	22	536
2	23	852
3	24	638
4	25	1504
...	...	...
72	105	26
73	117	28
74	118	166
75	119	66
76	2018	480642
77 rows × 2 columns

Lets see how many 40-year olds are riding bikes!

res = Q("""
 SELECT
(2018 - Birthyear) as Age,
COUNT(*)
FROM CitiBikes
GROUP BY Age
having Age = 40

""")
res

Output:

	Age	COUNT(*)
0	40	24706

Lets see if we can add Gender and see number of 40-year olds are riding Bikes

res = Q("""
SELECT
Gender,(2018 - Birthyear) as Age, 
COUNT(*)
FROM CitiBikes
GROUP BY Gender, Age
having Age = 40

""")
res

Output:

	Gender	Age	COUNT(*)
0	1	40	20126
1	2	40	4580

Lets look at some other Keywords (Order By)

Lets look at the Bike_Id (t) and the Trip_duration (in hours) to see the sum of each Bike used! GROUP BY -> The same Bike_Id (t)

ORDER BY is used to sort the result-set in ascending (ASC) or descending (DESC) order.

Descending Order: High -> Low Ascending Order: Low -> High

res = Q("""

SELECT t, sum(Trip_Duration)/3600
FROM CitiBikes
GROUP BY t
ORDER BY t DESC;

""")
res

Ouput:

	    t	sum(Trip_Duration)/3600
0	    20625	21
1	    20624	18
2	    20623	23
3	    20622	27
4	    20621	8
...	...	...
5789	14534	115
5790	14533	64
5791	14532	73
5792	14531	95
5793	14529	59
5794 rows × 2 columns

Lets look at Ascending Bike_Id order example:

res = Q("""

SELECT t, sum(Trip_Duration) /3600
FROM CitiBikes
GROUP BY t
ORDER BY t ASC;

""")
res

Output:

	    t	sum(Trip_Duration) /3600
0	    14529	59
1	    14531	95
2   	14532	73
3	    14533	64
4	    14534	115
...	...	...
5789	20621	8
5790	20622	27
5791	20623	23
5792	20624	18
5793	20625	21
5794 rows × 2 columns

Lets work on some more Queries so we can prepare to Graph! (histogram)

res = Q (
"""
SELECT t, sum(Trip_Duration) / 3600 as hours, round(sum(Trip_Duration) / 3600 / 100) * 100 as hour_bucket
FROM CitiBikes
GROUP BY t
ORDER By hours DESC

LIMIT 12
"""
)

res

Output:

	        t	hours	hour_bucket

    0	15259	2165	2100.0
    1	17918	1524	1500.0
    2	19866	1058	1000.0
    3	17215	782	    700.0
    4	17806	741	    700.0
    5	17917	619	    600.0
    6	18152	569	    500.0
    7	15043	492	    400.0
    8	19755	416	    400.0
    9	19010	382	    300.0
    10	17282	381	    300.0
    11	15948	380	    300.0


even more complex… (preparing for histogram)

res = Q ("""

SELECT hour_bucket, COUNT(*) from
(
SELECT t, sum(Trip_Duration) / 3600 as hours, round(sum(Trip_Duration) / 3600 / 100) * 100 as hour_bucket
FROM CitiBikes
GROUP BY t
)

GROUP BY hour_bucket

""")
res

Output:

	hour_bucket	COUNT(*)
0	    0.0	        4727
1	    100.0	    1006
2	    200.0	    40
3	    300.0	    12
4	    400.0	    2
5	    500.0	    1
6	    600.0	    1
7	    700.0	    2
8	    1000.0	    1
9	    1500.0	    1
10	    2100.0	    1

Lets Work on Cleaning our Table a little Bit!

Lets create a new table from our existing table and use some Columns to make “cleaner Columns” and such

New table name: Citi_Bike_Clean

New Column: Age and Sex (X,M, or F)

From our Citibikes table and when Age is > 0 and Age < 80 and Trip_Duration < 6000

db.execute("""
CREATE TABLE
    Citi_Bike_Clean AS
SELECT
    (2018 - Birthyear) AS age,
    CASE WHEN gender = 0 THEN "X"
         WHEN gender = 1 THEN "M"
         WHEN gender = 2 THEN "F"
    END as sex,

*
FROM CitiBikes
WHERE age > 0
  AND age <80
  AND Trip_Duration < 6000
  
    
""")

Output

<sqlite3.Cursor at 0x7f9294c12f80>

Lets check out our new table that we made!

res = Q("SELECT * FROM Citi_Bike_Clean")
res

Output:

	age	sex	Trip_Duration	StartTime	StopTime	StartStationID	StartStationName	StartStationLat	StartStationLong	EndStationID	EndStationName	EndStationLat	EndStationlog	BikeId	Usertype	Birthyear	Gender
0	23	M	1404	"2020-02-01 00:00:05.9460"	"2020-02-01 00:23:30.7240"	316	"Fulton St & William St"	40.709560	-74.006536	481	"S 3 St & Bedford Ave"	40.712605	-73.962644	28874	"Customer"	1995	1
1	27	M	1301	"2020-02-01 00:00:06.2230"	"2020-02-01 00:21:48.0580"	237	"E 11 St & 2 Ave"	40.730473	-73.986724	539	"Metropolitan Ave & Bedford Ave"	40.715348	-73.960241	32588	"Subscriber"	1991	1
2	24	M	474	"2020-02-01 00:00:15.7210"	"2020-02-01 00:08:10.3440"	528	"2 Ave & E 31 St"	40.742909	-73.977061	3785	"W 42 St & 6 Ave"	40.754920	-73.984550	41013	"Subscriber"	1994	1
3	45	F	487	"2020-02-01 00:00:21.0520"	"2020-02-01 00:08:28.7520"	380	"W 4 St & 7 Ave S"	40.734011	-74.002939	3263	"Cooper Square & Astor Pl"	40.729515	-73.990753	27581	"Subscriber"	1973	2
4	24	M	619	"2020-02-01 00:00:27.4000"	"2020-02-01 00:10:47.0640"	472	"E 32 St & Park Ave"	40.745712	-73.981948	237	"E 11 St & 2 Ave"	40.730473	-73.986724	29062	"Subscriber"	1994	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
99995	48	M	215	"2020-02-03 17:40:13.0600"	"2020-02-03 17:43:48.1790"	3783	"Cliff St & Fulton St"	40.708380	-74.004950	319	"Fulton St & Broadway"	40.711066	-74.009447	34144	"Subscriber"	1970	1
99996	43	M	459	"2020-02-03 17:40:13.1930"	"2020-02-03 17:47:52.2710"	513	"W 56 St & 10 Ave"	40.768254	-73.988639	3809	"W 55 St & 6 Ave"	40.763189	-73.978434	30136	"Subscriber"	1975	1
99997	19	F	244	"2020-02-03 17:40:13.4040"	"2020-02-03 17:44:18.2250"	426	"West St & Chambers St"	40.717548	-74.013221	328	"Watts St & Greenwich St"	40.724055	-74.009660	21171	"Subscriber"	1999	2
99998	25	F	238	"2020-02-03 17:40:13.6950"	"2020-02-03 17:44:11.9610"	3126	"44 Dr & Jackson Ave"	40.747182	-73.943264	3123	"31 St & Thomson Ave"	40.744697	-73.935404	35312	"Subscriber"	1993	2
99999	49	F	148	"2020-02-03 17:40:13.9880"	"2020-02-03 17:42:42.9180"	3263	"Cooper Square & Astor Pl"	40.729515	-73.990753	403	"E 2 St & 2 Ave"	40.725029	-73.990697	41610	"Subscriber"	1969	2
100000 rows × 17 columns

And there it is, out new Table: Citi_Bike_Clean with the new Columns: age and sex!

Now lets take our new Table and do some stuff with it!

%pylab inline
Populating the interactive namespace from numpy and matplotlib
import seaborn

Lets Group the Age of the riders:

res = Q(
"""
SELECT age, COUNT(*) AS c
FROM Citi_Bike_Clean
GROUP BY age
""")
res

Output:

	age	c
0	14	9
1	15	1102
2	16	1908
3	17	3428
4	18	7969
...	...	...
61	75	669
62	76	588
63	77	416
64	78	553
65	79	105
66 rows × 2 columns

Doing the above example with a limit of 25 ( first 25 ages) just to show “Cleaner” table

res = Q(
"""
SELECT age, COUNT(*) AS c
FROM Citi_Bike_Clean
GROUP BY age
LIMIT 25
""")

res

Output

	age	c
0	14	9
1	15	1102
2	16	1908
3	17	3428
4	18	7969
5	19	11081
6	20	11124
7	21	15293
8	22	23756
9	23	33952
10	24	37243
11	25	39164
12	26	43190
13	27	44856
14	28	45525
15	29	45658
16	30	42212
17	31	37426
18	32	35366
19	33	35985
20	34	32470
21	35	32266
22	36	28700
23	37	26938
24	38	26511

Look its from Age 14 to Age 38!

Lets make a Histogram !

hist(res.age, weights=res.c, bins=len(res))

Output:

inserting an Image

res = Q("""
SELECT
    age,
    sum( CASE WHEN sex="F" then 1 end) as "F",
    sum( CASE WHEN sex="M" then 1 end) as "M"
from 
 Citi_Bike_Clean
GROUP BY 1
""")
res.head()

Output:


	age	F	    M
0	14	3	    6
1	15	151	    951
2	16	284 	1618
3	17	570	    2817
4	18	1406	6404

Lets Plot!!!!

figure(figsize=(5,12))
barh(res.age, res.F/res.F.sum(), color='r', label='female')
barh(res.age, -res.M/res.M.sum(), color ='b', label = 'male')
title('Age Distribution')
ylabel("Age [Years]")
grid()
legend()

Output:

inserting an Image

Lets see how long the trips are (in minutes ) by Gender

more…

res = Q("""
SELECT
    Trip_Duration / 60 AS 'd',
    sum (case when sex = "F" then 1 end) AS "F",
    sum (case when sex = "M" then 1 end) AS "M"
FROM
    Citi_Bike_Clean
GROUP BY 1

    
""")

res

Output:

    Duration	    F	M
0	1	    3720	20979
1	2	    10005	48018
2	3	    15488	64807
3	4	    18645	71395
4	5	    19424	71060
5	6	    19087	66893
6	7	    18071	59863
7	8	    16155	52887
8	9	    14453	45587
9	10  	12956	39497
10	11	    11346	33745
11	12  	9970	29035
12	13  	8920	25128
13	14	    7893	22126
14	15  	6920	19160
15	16  	6208	16653
16	17  	5451	14469
17	18  	4931	12704
18	19	    4325	11520
19	20  	3982	10436
20	21  	3579	9258
21	22  	3244	8451
22	23  	2857	7648
23	24  	2645	6924
24	25  	2469	6451

Lets Graph!

figure(figsize=(5,12))
barh(res.Duration, res.F/res.F.sum(), color='r', label='Female')
barh(res.Duration, -res.M/res.M.sum(), color ='b', label = 'Male')
title('Duration distribution')
ylabel("Duration [Minutes]")
grid()
legend()

Output: inserting an Image

Lets Look at the Stations

res = Q("""
SELECT
    StartStationName,
    StartStationLat AS Lat,
    StartStationLong AS Long,
    COUNT (*) as C
FROM
    Citi_Bike_Clean

GROUP by 1,2,3

res.head()
	StartStationName	Lat	Long	C
0	"1 Ave & E 110 St"	40.792327	-73.938300	619
1	"1 Ave & E 16 St"	40.732219	-73.981656	4382
2	"1 Ave & E 18 St"	40.733812	-73.980544	4097
3	"1 Ave & E 30 St"	40.740000	-73.974000	1
4	"1 Ave & E 30 St"	40.741444	-73.975361	2577


hist(res.Lat, range=(40.6, 40.9), bins = 100)

Output: inserting an Image

hist(res.Long, range = (-74.1, -73.8), bins = 100)

Output: inserting an Image

Lets plot Long vs Lat

plot(res.Long, res.Lat, '.')
title('Long vs Lat')
ylabel("Lat")
xlabel("Long")
grid()

Output:

inserting an Image

What does this Graph above look like? Oooo NYC!!!

inserting an Image


figure(figsize = (10,10))
xstart = -74.04
ystart = 40.645
extent = .18
scatter(res.Long, res.Lat, c = 35*res.C/res.C.max(), cmap ='rainbow')
xlim(xstart, xstart+extent)
ylim(ystart, ystart+extent)
title('Long vs Lat')
ylabel("Lat")
xlabel("Long")
grid()
colorbar()

Output: inserting an Image

Lets make this a function:

def my_plot(x, y, s, c, max_size = 50):

    figure(figsize = (10,10))
    xstart = -74.04
    ystart = 40.645
    extent = .18
    scatter(x, y, s = max_size*s/max(s), c=c, cmap ='rainbow')
    xlim(xstart, xstart+extent)

    ylim(ystart, ystart+extent)
    title('Long vs Lat')
    ylabel("Lat")
    xlabel("Long")
    grid()
    colorbar()


Lets use this function

my_plot(res.Long, res.Lat, res.C, res.C)

inserting an Image

Lets look at Gender and Location

res = Q(
"""
SELECT

    StartStationName,
    StartStationLong as Long,
    StartStationLat as Lat,
    
    AVG(Trip_Duration) as D,
    
    sum (case when sex ="F" then 1 end) as "F",
    sum (case when sex ="M" then 1 end) as "M"
FROM
    Citi_Bike_Clean

GROUP BY 1,2,3

"""
)
res.head()

Output:

	StartStationName	Long	Lat	D	F	M
0	"1 Ave & E 110 St"	-73.938300	40.792327	665.381260	131.0	474.0
1	"1 Ave & E 16 St"	-73.981656	40.732219	592.057052	997.0	3198.0
2	"1 Ave & E 18 St"	-73.980544	40.733812	587.435685	1072.0	2898.0
3	"1 Ave & E 30 St"	-73.975361	40.741444	696.627862	540.0	1944.0
4	"1 Ave & E 30 St"	-73.974000	40.740000	237.000000	1.0	NaN

Lets Plot!

my_plot (res.Long, res.Lat, res.D, 100*res.M/(res.M+res.F))

Ouput: inserting an Image

Looking at the image, the more red the color means there is more Males while the bluer the color means more Females

Lets look at Females, their age, and Location

res = Q(
"""
SELECT

    StartStationName,
    StartStationLong as Long,
    StartStationLat as Lat,
        
    AVG (case when sex ="F" then age end) as "F",
    AVG (case when sex ="M" then age end) as "M",
    COUNT (*) as C
FROM
    Citi_Bike_Clean

GROUP BY 1,2,3

"""
)
my_plot(res.Long, res.Lat, res.F*1,res.F )

inserting an Image

Red is older, while blue is younger!