Spotify ETL Project
Extract Transform Load Project using Spotify API!
- Work in Progress
Goals:
- Learn about ETL Process using Python
- Creating Databases, Star schema for designing Warehouse/Tables
- Writing more complex Queries in SQL (Joins, Functions, Subqueries)
- Work with APIs (Spotify in this problem)
- Work with Apache Airflow to automate and schedule tasks
- Learn about Machine Learning and recommendating songs (Not finished yet)
We are working with the Spotify API, so were going to use the library Spotipy
Project Scope:
- We’re going to build a ETL Pipeline that extracts the most recently listen to music on Spotify into a Database using
Tools used:
Create an App in Spotify Developer to use
The first step to working with the Spotify API is to create a new App, go here and follow the steps below to create a New App:
- Copy the Spotify Client Id down
- Copy the Spotify Client Secret down
- Go to Edit Settings and under Redirect URIs put http://localhost and then save
Create Database and populate with our Tables in Postgres
First we created a database named Spotify or whatever you’d like to call it in Postgres to store our tables in.
CREATE DATABASE Spotify;
You can spend time looking different responses for different API calls, for this project we will make 4 tables to store the data.
- Using the Star Schema for this Database since it is the most widely used to develop data warehouses. We will create a fact table named spotify_track and three dimensional tables named spotify_album, spotify_artists, and spotify_audio_analysis. The fact table will reference on album_id for the spotify_album table, artist_id on the spotify_artists and song_id on the spotify_audio_analysis table.
CREATE TABLE spotify_track(
unique_id TEXT PRIMARY KEY NOT NULL,
song_id TEXT NOT NULL,
song_name TEXT,
duration_ms INTEGER,
url TEXT,
date_time_played TIMESTAMP,
album_id TEXT,
artist_id TEXT,
date_time_inserted TIMESTAMP DEFAULT CURRENT_TIMESTAMP -- time we insert it into the table
);
CREATE TABLE spotify_album(
album_id TEXT NOT NULL PRIMARY KEY,
album_name TEXT,
release_date TEXT,
url TEXT,
image_url TEXT
);
CREATE TABLE spotify_artists(
artist_id TEXT PRIMARY KEY,
artist_name TEXT,
url TEXT);
CREATE TABLE spotify_audio_analysis(
song_id TEXT NOT NULL PRIMARY KEY,
acousticness decimal(5,4),
danceability decimal(5,3),
energy decimal(5,3),
liveness decimal(5,3),
loudness decimal(5,2),
tempo decimal(5,2),
valence decimal(5,3),
speechiness decimal(5,5),
instrumentalness decimal(5,5) );
Now that we’ve created our Database and the tables we can start writing a Python Script to work with Spotify API!
Add Libaries needed!
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd
import sys
import psycopg2
from sqlalchemy import create_engine
We are going to be using different libaries in Python including spotipy which works with Spotify API!
For our database we’re going to be using PostgresSQL so we will use psycopg2 and sqlalchemy to work with the database inside Python! The last libary that we will use will be pandas which will allow use to work/change our data around!
Python Code to extract Spotify Data
Connect to Spotify via API
Here we can connect to Spotify API and use our client id and client secret here. For our project were returning current_user_recently_played, which will return the 50 most recent played songs.
# Creating Instance to work with Spotify Api
spotify_client_id = " " ## ADDD HERE
spotify_client_secret = "" ### ADD HERE
spotify_redirect_url = "http://localhost"
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=spotify_client_id,
client_secret=spotify_client_secret,
redirect_uri=spotify_redirect_url,
scope="user-read-recently-played"))
# Will return 50 most recent played songs
recently_played = sp.current_user_recently_played(limit=50)
Returned to us will be a JSON Object.
Note: Using a JSON converter to organize the JSON code is a amazing! JSON Converter
Otherwise it’s super consuing
Iterate through JSON, Start Creating our Tables
After looking at the JSON code, we can parse through the data. Here I am using lists for each category, which we can append to when we iterate through the JSON response. Then at the end we can create a dictionary and combine all our lists to create a dictionary of our columns.
# Creating the Album Data Structure:
album_dict = {}
album_id = []
album_name = []
album_release_date = []
album_url = []
album_image = []
for row in recently_played['items']:
album_id.append(row['track']['album']['id'])
album_name.append(row['track']['album']['name'])
album_release_date.append(row['track']['album']['release_date'])
album_url.append(row['track']['album']['external_urls']['spotify'])
album_image.append(row['track']['album']['images'][1]['url'])
album_dict = {'album_id':album_id,'album_name':album_name,'release_date':album_release_date,'url':album_url, 'image_url': album_image }
# Creating the Artist Data Structure !
artist_dict = {}
id_list = []
name_list = []
url_list = []
for row in recently_played['items']:
id_list.append(row['track']['artists'][0]['id'])
name_list.append(row['track']['artists'][0]['name'])
url_list.append(row['track']['artists'][0]['external_urls']['spotify'])
artist_dict = {'artist_id':id_list,'artist_name':name_list,'url':url_list} # Combine
# For Track Data Structure
song_dict = {}
song_id = []
song_name = []
song_duration = []
song_url = []
song_time_played = []
album_id = []
song_element = []
artist_id = []
for row in recently_played['items']:
song_id.append(row['track']['id'])
song_name.append(row['track']['name'])
song_duration.append(row['track']['duration_ms'])
song_url.append(row['track']['external_urls']['spotify'])
song_time_played.append(row['played_at'])
album_id.append(row['track']['album']['id'])
artist_id.append(row['track']['album']['artists'][0]['id'])
song_dict = {'song_id':song_id,'song_name':song_name,'duration_ms':song_duration,'url':song_url,
'date_time_played':song_time_played,'album_id':album_id,
'artist_id':artist_id
}
Convert Dictionaries to Dataframes
Now we have extracted data from Spotify into 3 Dictionaries.
We will look at the audio analysis table later…
Now we can convert the dictionaries to Dataframes using Pandas libary and drop any duplicates using the drop_duplicates function.
Album_df = pd.DataFrame.from_dict(album_dict)
Album_df = Album_df.drop_duplicates(subset=['album_id'])
Artist_df = pd.DataFrame.from_dict(Artist_Dict)
Artist_df = Artist_Df.drop_duplicates(subset=['artist_id'])
Song_df = pd.DataFrame.from_dict(song_dict)
Adjust Song Dataframes
For the Song_df, after we convert it to a Dataframe we have to adjust a few things:
- When we get the date when the song was played at it’s returned into: “2022-04-25T17:53:17.481Z”, in order to convert this we must:
a. Use to_datetime which is a funciton that converts a scalar, array-like, Series or DataFrame/dict-like to a pandas datetime object, which is a Timestamp.
From 2022-04-28T20:46:55.013Z to -> 2022-04-28 20:46:55.013000+00:00
b. Convert to the Eastern Time Zone
From 2022-04-28 20:46:55.013000+00:00 to -> 2022-04-28 16:46:55.013000-04:00
c. Remove the timezone part from the date/time/timezone.
From 2022-04-28 16:46:55.013000-04:00 to -> 2022-04-28 16:57:19.77400
d. Convert back to datetime datatype
e. Lets create a Unix Timestamp for time played, which will help use represent unique identifier, which will create another column in our dataframe
f. Now we can create a column for unique for each time a song is played! We can name the new column unique_id that will be one-half our song_id and the second-half will be our UNIX value that we just created above.
Example: 7lQ8MOhq6IN2w8EYcFNSUk + 1651165753 = 7lQ8MOhq6IN2w8EYcFNSUk-1651165753
g. Now we can add the new column unique_id to our dataframe.
#a
Song_df['date_time_played'] = pd.to_datetime(Song_df['date_time_played'])
#b
Song_df['date_time_played'] = Song_df['date_time_played'].dt.tz_convert('US/Eastern')
#c
Song_df['date_time_played'] = Song_df['date_time_played'].astype(str).str[:-7]
#d
Song_df['date_time_played'] = pd.to_datetime(Song_df['date_time_played'])
#e
Song_df['UNIX_Time_Stamp'] = (Song_df['date_time_played'] - pd.Timestamp("1970-01-01"))//pd.Timedelta('1s')
#f
Song_df['unique_id'] = Song_df['song_id'] + "-" + Song_df['UNIX_Time_Stamp'].astype(str)
#g
Song_df = Song_df[['unique_id','song_id','song_name','duration_ms','url','date_time_played','album_id','artist_id']]
Connnect to Postgres Database
Now with our Dataframes cleaned and ready, we can connect to our Spotify Database that we made earlier and populate it with the data from our Dataframes.
Let’s First Connect to our Postgres DataBase using psycopg2, wwhere:
-
host = LocalHost (on my laptop)
-
dbname = Name of Database we made above (Spotify)
-
port = whatever port is set to
-
user = Whatever user you used in your postgres database for the DB you made above
In *engine = create_engine(‘postgresql+psycopg2://@/’) we can replace with postgresql+psycopg2://user@host:port/dbname, with user being user, port, and dbname!
More on Psycopg2 Library here
conn = psycopg2.connect(host = "",user = "", port="", dbname = "")
cur = conn.cursor()
engine = create_engine('postgresql+psycopg2://@/')
conn_eng = engine.raw_connection()
cur_eng = conn_eng.cursor()
Add Data to SQL Tables
Now that we are connected with our Postgres Database we can use the Engine to execute some SQL Queries.
A Strategy we can use to import into SQL Tables is to create a temp table (which mirror its equivalent table) and then query temp table into our actual table.
Using CRUD (Create, Read, Update, Delete), so we create temp tables then join our actual table in our database. We will use this method below for inserting data into our tables!
# Importing the Song_df into the SQL table
cur_eng.execute(
"""
CREATE TEMP TABLE temp_track AS SELECT * FROM spotify_track LIMIT 0
""")
Song_df.to_sql("temp_track", con = engine, if_exists='append', index = False)
# Now we can move the data from the temp table into our actual table
cur.execute(
"""
INSERT INTO spotify_track
SELECT *
FROM temp_track
LEFT JOIN spotify_track USING (unique_id)
WHERE spotify_track.unique_id IS NULL;
DROP TABLE temp_track
""")
conn.commit()
We can do the same for the other two tables.
cur_eng.execute(
"""
CREATE TEMP TABLE temp_album AS SELECT * FROM spotify_album LIMIT 0
""")
album_df.to_sql("temp_album", con = engine, if_exists='append', index = False)
conn_eng.commit() # running a query!
#Moving from Temp Table to Production Table
cur.execute(
"""
INSERT INTO spotify_album
SELECT *
FROM temp_album
LEFT JOIN spotify_album USING (album_id)
WHERE spotify_album.album_id IS NULL;
DROP TABLE temp_album""")
conn.commit()
# Artists Table
cur_eng.execute(
"""
CREATE TEMP TABLE temp_artist AS SELECT * FROM spotify_artists LIMIT 0
""")
artist_df.to_sql("temp_artist", con = engine, if_exists='append', index = False) # command
conn_eng.commit()
#Moving data from temp table to production table
cur.execute(
"""
INSERT INTO spotify_artists
SELECT *
FROM temp_artist
LEFT JOIN spotify_artists USING (artist_id)
WHERE spotify_artists.artist_id IS NULL;
DROP TABLE temp_artist""")
conn.commit()
Now we can move onto the other table in our database.
Spotify Audio Analysis
In a similar process we will connect to the Spotify API again and use the recently played tracks to grab the song_id and song_name and place them into a list. We will use the list to loop through and look up each song (using song_id) using .audio_features function from the Spotifpy library and extract different audio features including acousticness, danceability, liveness, and energy for each song.
# Connect to Spotify Again
spotify_client_id = " " ## ADDD HERE
spotify_client_secret = "" ### ADD HERE
spotify_redirect_url = "http://localhost"
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=spotify_client_id,
client_secret=spotify_client_secret,
redirect_uri=spotify_redirect_url,
scope="user-read-recently-played"))
recently_played = sp.current_user_recently_played(limit=50)
# Return dataframe of song_id
song_id = []
song_name = []
for row in recently_played['items']:
song_id.append(row['track']['id'])
song_name.append(row['track']['name'])
track_analysis = {}
song_id_list = []
acousticness = []
danceability = []
energy = []
liveness = []
loudness = []
tempo = []
valence = []
instrumentalness = []
speechiness = []
for song in song_id:
analysis_ = sp.audio_features(song)
song_id_list.append(song)
# Iterate through analysis and add columns to list!!!
acousticness.append(analysis_[0]['acousticness'])
danceability.append(analysis_[0]['danceability'])
energy.append(analysis_[0]['energy'])
liveness.append(analysis_[0]['liveness'])
loudness.append(analysis_[0]['loudness'])
tempo.append(analysis_[0]['tempo'])
valence.append(analysis_[0]['valence'])
speechiness.append(analysis_[0]['speechiness'])
instrumentalness.append(analysis_[0]['instrumentalness'])
track_analysis = {'song_id':song_id_list,'acousticness':acousticness,'danceability':danceability, 'energy': energy,'liveness': liveness, 'loudness' : loudness, 'tempo': tempo,'valence': valence, 'instrumentalness': instrumentalness, 'speechiness': speechiness }
# Convert from Dictionary to Dataframe using Pandas, keep = False which is to drop the indexes
track_analysis_df = pd.DataFrame.from_dict(track_analysis)
track_analysis_df = track_analysis_df.drop_duplicates(subset=['song_id'], keep = False)
Add Audio Features to Postgres Database
Just like what we did for the other 3 tables before, we will connect to postgres database and add audio features to the table.
'''Connect to our Postgres Database'''
conn = psycopg2.connect(host = "localhost", user = "devinpowers",port="5433", dbname = "spotify")
cur = conn.cursor()
engine = create_engine('postgresql+psycopg2://devinpowers@localhost:5433/spotify') # this is
conn_eng = engine.raw_connection()
cur_eng = conn_eng.cursor()
cur_eng.execute(
"""
CREATE TEMP TABLE IF NOT EXISTS tmp_spotify_audio_new AS SELECT * FROM spotify_audio_analysis LIMIT 0
""")
data_frame.to_sql("tmp_spotify_audio_new", con = engine, if_exists='append', index = False)
#Moving data from temp table to production table
cur.execute(
"""
INSERT INTO spotify_audio_analysis
SELECT * FROM tmp_spotify_audio_new
WHERE tmp_spotify_audio_new.song_id NOT IN (SELECT song_id FROM spotify_audio_analysis );
DROP TABLE tmp_spotify_audio_new
""")
conn.commit()
Now we have all our data extract and transformed our data from Spotify to our Postgres Database!!
SQL Queries to Extract information from our Spotify Database
With our database full of information, we can run queries (turn them into functions) to show statistics from our most recently played music (eg. think our top 10 Artists played the past 7 days)
Lets look at some sample queries that we can run against our database.
Top 10 Albums Played this week
Return the top 10 Albums that I listened to this week and a count of the number of times I listened to the album!
SELECT artist.name AS Artist_name,album.name AS Album_Name, COUNT(track) AS Number_of_Plays
FROM spotify_track AS track
INNER JOIN spotify_album AS album ON track.album_id = album.album_id
INNER JOIN spotify_artists AS artist ON track.artist_id = artist.artist_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY album.name, Artist_name
ORDER BY Number_of_Plays DESC
LIMIT 10;
Output if we ran query above
artist_name | album_name | number_of_plays |
---|---|---|
Outkast | Stankonia | 34 |
Outkast | Speakerboxxx/The Love Below | 29 |
The Rolling Stones | Aftermath | 16 |
The Rolling Stones | Some Girls | 14 |
The Rolling Stones | Exile On Main Street (2010 Re-Mastered) | 11 |
Metallica | Garage, Inc. | 10 |
Outkast | ATLiens | 10 |
Janis Joplin | Pearl (Legacy Edition) | 9 |
Red Hot Chili Peppers | Stadium Arcadium | 8 |
Lil Durk | Broadway Girls (feat. Morgan Wallen) | 7 |
Top 10 Artists Played this week
Returns the Top 10 Artists I listened to this week and a count of the number of times I listened to them!
SELECT artist.name, COUNT(track) AS Number_of_Plays
FROM spotify_track AS track
INNER JOIN spotify_artists AS artist ON track.artist_id=artist.artist_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY artist.name
ORDER BY Number_of_Plays DESC
LIMIT 10;
Top 5 Most Danceable Songs listened to
Returns the Top 5 songs that had the highest danceability score from the past week.
SELECT st.song_name, sa.name AS artist_name, aas.danceability
FROM spotify_track AS st
INNER JOIN spotify_artists AS sa ON st.artist_id = sa.artist_id
JOIN spotify_audio_analysis AS aas ON st.song_id = aas.song_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY st.song_name, sa.name, danceability
ORDER BY danceability DESC
LIMIT 5;
Top 4 Albums Played with Links
Return the top 4 albums of the week along with the album url link and the album link.
SELECT album.name, album.url AS album_url, album.image_url AS album_cover, COUNT(track.*)::INT AS number_plays
FROM spotify_track AS track
INNER JOIN spotify_album AS album ON track.album_id = album.album_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY album.name, album.url, album.image_url
ORDER BY number_plays DESC
LIMIT 4;
Total Time
Returns the total time spend listening to Spotify in the past week in hours
SELECT ROUND(SUM(CAST (duration_ms AS decimal)/3600000),2) AS total_hours_listened
FROM spotify_track
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days';
Most Popular Songs and Arists Names by Number of Plays
Returns the 5 most popular songs listened to this week along with the artist name and the count of times played.
SELECT st.song_name, sa.name AS artist_name,COUNT(st.*) AS times_played
FROM spotify_track AS st
INNER JOIN spotify_artists AS sa
ON st.artist_id = sa.artist_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY st.song_name, sa.name
ORDER BY times_played DESC
LIMIT 5;
5 Loudest Songs of the Week
Return the top 5 loudest songs, loudest is found using the loudness column.
SELECT st.song_name, sa.name AS artist_name, aas.loudness
FROM spotify_track AS st
INNER JOIN spotify_artists AS sa ON st.artist_id = sa.artist_id
JOIN spotify_audio_analysis AS aas ON st.song_id = aas.song_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY st.song_name, sa.name, loudness
ORDER BY loudness ASC
LIMIT 5;
Top 5 Good Feeling Songs listened to the last 7 days (High Valence)
Return the TOp 5 Feel good songs of the week!
SELECT st.song_name, sa.name AS artist_name, aas.valence
FROM spotify_track AS st
INNER JOIN spotify_artists AS sa ON st.artist_id = sa.artist_id
JOIN spotify_audio_analysis AS aas ON st.song_id = aas.song_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY st.song_name, sa.name, valence
ORDER BY valence DESC
LIMIT 5;
Top Decades
Lets try to create a view in SQL that uses Case Statements to find what decade our song(s) were released.
We can alter our release_date coloumn by:
CAST(SPLIT_PART(release_date,’-‘,1) AS INT) will split dates from YYYY-MM-DD to YYYY and save it as an Integer
Where:
PostgreSQL CAST is used to convert from one data type into another.
PostgreSQL SPLIT_PART is used to split a string -> SPLIT_PART(string, delimiter, position), where string is the string to be split, delimiter is the string used as the delimiter for splitting, and position is the part to return.
Then create a subquery that we can perform the View over.
CREATE OR REPLACE VIEW track_decades AS
SELECT *,
CASE
WHEN subquery.release_year >= 1940 AND subquery.release_year <= 1949 THEN '1940''s'
WHEN subquery.release_year >= 1950 AND subquery.release_year <= 1959 THEN '1950''s'
WHEN subquery.release_year >= 1960 AND subquery.release_year <= 1969 THEN '1960''s'
WHEN subquery.release_year >= 1970 AND subquery.release_year <= 1979 THEN '1970''s'
WHEN subquery.release_year >= 1980 AND subquery.release_year <= 1989 THEN '1980''s'
WHEN subquery.release_year >= 1990 AND subquery.release_year <= 1999 THEN '1990''s'
WHEN subquery.release_year >= 2000 AND subquery.release_year <= 2009 THEN '2000''s'
WHEN subquery.release_year >= 2010 AND subquery.release_year <= 2019 THEN '2010''s'
WHEN subquery.release_year >= 2020 AND subquery.release_year <= 2029 THEN '2020''s'
ELSE 'Other'
END AS decade
FROM
(SELECT album.album_id,album.name,album.release_date,track.unique_identifier,track.date_time_played,track.song_name,CAST(SPLIT_PART(release_date,'-',1) AS INT) AS release_year
FROM spotify_album AS album
INNER JOIN spotify_track AS track ON track.album_id = album.album_id) AS subquery;
Turning Queries into Functions
We can turn any query into a function (Stored Procedures) that we can resuse later when were sending out our email.
Use ::* in Postgres to cast from one datatype to another, which is INT.
CREATE FUNCTION function_last_7_days_artist_played()
RETURNS TABLE (name TEXT, number_plays INT) LANGUAGE plpgsql AS $$
BEGIN
RETURN query
SELECT art.name, COUNT(track.*):: INT AS number_plays
FROM spotify_track AS track
INNER JOIN spotify_artists AS art ON track.artist_id=art.artist_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY art.name
ORDER BY number_plays DESC
LIMIT 5;
end;$$
Creating a Email to Send
In the next step of the ETL Project we can write a python script that will send us an email with our stats from the week. You can learn sending emails using Automating the Boring Stuff Page.
To send an exmail via smtplib, you have to create a Google App which will create a password for you to login to to access the app.
Connecting to Postgres Database
First we must connect to our database and then we can use the functions and run them on our database.
import psycopg2
conn = psycopg2.connect(host = "localhost", user = "devinpowers",port="5433", dbname = "spotify")
cur = conn.cursor()
Create Lists so store information
Once connected to our database we can use callproc function to run our functions (saved queries) over our database and store the returned values into a list.
We can use the fetchall() function to extract from our query output and put the data into lists. Then take our lists and print them out using tabulate
from datetime import datetime, timedelta
today = datetime.today().date()
six_days_ago = today - timedelta(days=6)
#Top 5 Songs by Time Listened (MIN)
top_5_songs_min = []
cur.callproc('function_last_7_days_top_5_songs_duration')
for row in cur.fetchall():
song_name = row[0]
min_listened = float(row[1])
element = [song_name,min_listened]
top_5_songs_min.append(element)
# Finding the time in hours spent listening the past 7 days
cur.callproc('function_last_7_days_hrs_listened_')
total_time_listened_hrs = float(cur.fetchone()[0]) # Convert to float
#Top 5 Songs and Artists by Times Played
top_songs_art_played = []
cur.callproc('function_last_7_days_songs_artist_played')
for row in cur.fetchall():
song_name = row[0]
artist_name = row[1]
times_played = int(row[2])
element = [song_name,artist_name,times_played]
top_songs_art_played.append(element)
#Top Decades:
top_decade_played = []
cur.callproc('function_last_7_days_top_decades')
for row in cur.fetchall():
decade = row[0]
times_played = int(row[1])
element = [decade,times_played]
top_decade_played.append(element)
#Top Artists Played
top_art_played = []
cur.callproc('function_last_7_days_artist_played')
for row in cur.fetchall():
artist_name = row[0]
times_played = int(row[1])
element = [artist_name,times_played]
top_art_played.append(element)
#Collecting Top Albums of the Week
album_url = []
track_url= []
album_name = []
cur.callproc('function_albums')
for row in cur.fetchall():
album_name.append(row[0])
album_url.append(row[1])
track_url.append(row[2])
Creating the Email
The email format is made using html.
import smtplib,ssl
import json
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from tabulate import tabulate
#Sending the Email:
port = 456
password = ""
sender_email = "devinjpowers@gmail.com"
receiver_email = "powers88@msu.edu"
message = MIMEMultipart("alternative")
message["Subject"] = f"Spotify - Weekly Roundup for - {today}"
message["From"] = sender_email
message["To"] = receiver_email
text = f"""\
Here are your stats for your weekly round up for Spotify.
Dates included: {six_days_ago} - {today}:
"""
html = f"""\
<html>
<body>
<h4>
Hello Mr.Powers, here are your stats for your weekly round up for Spotify.
</h4>
<h4>
Dates included: {six_days_ago} - {today}
</h4>
<h4>
Total Time Listened: {total_time_listened_hrs} hours.
</h4>
<h4>
Here are your Top Albums of the Week:
</h4>
<br>
1. {album_name[0]}
<br>
2. {album_name[1]}
<br>
3. {album_name[2]}
<br>
4. {album_name[3]}
</br>
<a href = {album_url[0]}>
<img src= {track_url[0]} alt="idk" width="300" height="300" class = "center">
</a>
<a href = {album_url[1]}>
<img src= {track_url[1]} alt="idk" width="300" height="300" class = "center">
</a>
<a href = {album_url[2]}>
<img src= {track_url[2]} alt="idk" width="300" height="300" class = "center">
</a>
<a href = {album_url[3]}>
<img src= {track_url[3]} alt="idk" width="300" height="300" class = "center">
</a>
<p>
<h4>
You listened to these songs and artists a lot here are your top 5!
</h4>
{tabulate(top_songs_art_played, headers = ["Song Name", "Artist Name", "Times Played"], tablefmt='html')}
<h4>
You spend a lot of time listening to these songs!
</h4>
{tabulate(top_5_songs_min, headers = ['Song Name', 'Time (Min)'], tablefmt='html')}
<h4>
Here are your Top 10 played artists!
</h4>
{tabulate(top_art_played, headers = ['Artist Name','Times Played'], tablefmt='html')}
<h4>
Here are your Top 10 most played albums!
</h4>
{tabulate(top_albums_played, headers = ['Album Name', 'Times Played'], tablefmt='html')}
<h4>
Here are your top decades:
</h4>
{tabulate(top_decade_played, headers = ['Decade','Times Played'], tablefmt='html')}
<h4>
Lastly your top danceable songs are here as the following:
</h4>
{tabulate(top_dance_songs, headers = ['Song Name', 'Artist', 'Danceability'], tablefmt='html')}
<h4>
Lets Dance
</h4>
<a href = {dance_album_url[0]}>
<img src= {dance_album_cover[0]} alt="idk" width="300" height="300" class = "center">
</a>
</p>
<br>
<h4>
Machine Learning using recent played music to recommend new music....coming soon...
</h4>
</br>
</body>
</html>"""
# Send Email!
part1 = MIMEText(text,"plain")
part2 = MIMEText(html,"html")
message.attach(part1)
message.attach(part2)
context = ssl.create_default_context()
with smtplib.SMTP_SSL("smtp.gmail.com",port,context = context) as server:
server.login("devinjpowers@gmail.com",password)
server.sendmail(sender_email,receiver_email,message.as_string())
When you click on of the 4 Album Cover images in the email, a hyperlink will send you to the album Spotify Page! Check it out below:
Example Email Send:
Tweet your Top Played Songs
In order to tweet via a Twitter account, you need to set up a Twitter Developer account and create a app and grab the API keys and tokens
import psycopg2
import smtplib,ssl
import json
from tabulate import tabulate
from datetime import datetime, timedelta
#Importing the module for Twitter!
import tweepy
# Keep Secret!
consumer_key=""
consumer_secret_key=""
access_token = ""
access_token_secret = ""
# Connect to my database and then run Queries over
conn = psycopg2.connect(host = "localhost", user = "devinpowers",port="5433", dbname = "spotify")
cur = conn.cursor()
today = datetime.today().date()
# Top 5 Songs I listened to this past Week
top_5_songs_min = []
# callproc: this method calls the stored procedure named by the proc_name argument.
cur.callproc('function_last_7_days_top_5_songs_duration')
for row in cur.fetchall():
song_name = row[0]
min_listened = float(row[1])
element = [song_name,min_listened]
top_5_songs_min.append(element)
# Find Album cover photo
album_url = []
track_url= []
album_name = []
cur.callproc('function_albums')
for row in cur.fetchall():
album_name.append(row[0])
album_url.append(row[1])
track_url.append(row[2])
# list to individual datatypes to store one song (so need 5)
song1 = top_5_songs_min[0][0]
song2 = top_5_songs_min[1][0]
song3 = top_5_songs_min[2][0]
song4 = top_5_songs_min[3][0]
song5 = top_5_songs_min[4][0]
# Call function to return total time listened to this past past!!
cur.callproc('function_last_7_days_hrs_listened_')
total_time_listened_hrs = float(cur.fetchone()[0])
# Twitter Stuff!!
auth = tweepy.OAuthHandler(consumer_key, consumer_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
tweet1 = f"""\
This week i've listen to {total_time_listened_hrs} of music! Here are my top 5 songs of the week!
1. {song1}
2. {song2}
3. {song3}
4. {song4}
5. {song5}
"""
tweet2 = f"""\
Link to Top Album of the Week
{album_url[0]}
"""
# List of tweets to send
tweets = [tweet1, tweet2]
try:
for tweet in tweets:
api.update_status(tweet)
print("Posted!")
except tweepy.error.TweepError as e:
print(e)
Tweet Output:
This week i've listen to 22.35 of music! Here are my top 5 songs of the week!!
— Austin (@notdevinpowers) April 30, 2022
1. Ms. Jackson
2. Hey Ya!
3. So Fresh, So Clean
4. Whiskey In The Jar
5. Roses
Link to Top Album of the Week!
— Austin (@notdevinpowers) April 30, 2022
https://t.co/U8qOajsGnn
Automate with Airflow
We can automate all of our scripts using Apache Airflow!
Note: I havent really tested these…
from datetime import timedelta
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
from airflow.utils.dates import days_ago
from spotify_etl_pipeline import etl_func1, etl_func2 # assuming we name our functions etl_func1 and etl_func2
my_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['devinjpowers@gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
my_dag = DAG(
'spotify_dag',
default_args = my_args,
description= 'Spotify ETL Process',
schedule_interval= '0 14 * * *'
)
run_etl1 = PythonOperator(
task_id='spotify_etl_1`',
python_callable= etl_func1,
dag=my_dag
)
run_etl2 = PythonOperator(
task_id='spotify_etl_2',
python_callable= etl_func2,
dag=my_dag
)
run_etl > run_etl2 # Doesn't really matter which one gets run first!
Dag to Email
DAG to send our email!
from datetime import timedelta
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
from Spotify_Email import spotify_etl_email
from airflow.utils.dates import days_ago
my_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['devinjpowers@gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
my_dag = DAG(
'spotify_etl_email',=
default_args = my_args,
description= 'Spotify Daily Email',
schedule_interval= '5 14 * * 0'
)
run_email = PythonOperator(
task_id='spotify_email_weekly_thing',
python_callable= spotify_etl_email,
dag=my_dag
)
run_email
Dag to Tweet
This Dag will run the function that will send tweet!
from datetime import timedelta
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
from Spotify_Twitter import spotify_etl_python # The function that has the twitter script
from airflow.utils.dates import days_ago
my_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(2),
'email': ['devinjpowers@gmail.com'],
'email_on_failure': True,
'email_on_retry': True,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
my_dag = DAG(
'spotify_etl_twitter',=
default_args = my_args,
description= 'Spotify Daily Email',
schedule_interval= '5 14 * * 0'
)
# Use PythonOperator to send our program as a Dag!
run_twitter = PythonOperator(
task_id='twitter_function',
python_callable= spotify_etl_python,
dag=my_dag
)
run_twitter
Machine Learning to Create Playlists from our Spotify Recently Played Playlist
Note: this is a work in Progress
Lets add on to our Project where we use Machine Learning to recommend songs and include it in our weekly email!
I’ve seen many videos of people using different audio stats from spotify songs and previous datasets (Kaggle) to build music reccomendations systems.
- For this part of the Project I will use cosine similarity which measures the cosine of the angle between two vectors projected in a multi-dimensional space.
Step 0: Explore our Spotify Data
Connect to our Postgres Database
import pandas as pd
import sys
import psycopg2
from sqlalchemy import create_engine
conn = psycopg2.connect(host = "localhost", user = "devinpowers",port="5432", dbname = "spotify")
cur = conn.cursor()
engine = create_engine('postgresql+psycopg2://devinpowers@localhost:5432/spotify')
conn_eng = engine.raw_connection()
cur_eng = conn_eng.cursor()
Run Qeury to Retrieve all songs played past 7 Days
cur = conn.cursor()
sql = '''SELECT song_name,COUNT(spotify_track.*)::INT AS times_played, acousticness, danceability, energy, liveness, loudness, tempo, valence, speechiness
FROM spotify_audio_analysis
JOIN spotify_track ON spotify_audio_analysis.song_id = spotify_track.song_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY song_name, acousticness, danceability, energy, liveness, loudness, tempo, valence, speechiness
ORDER BY times_played DESC
'''
cur.execute(sql)
results = cur.fetchall()
Work with the dataframe to clean and prepare for exploring.
# Convert list in Python to a dataframe using Pandas
df = pd.DataFrame(results)
df.columns = ['Song Name','times_played', 'acousticness', 'danceability', 'energy', 'liveness', 'loudness', 'tempo','valence', 'speechiness']
# Optional: Can save dataframe as a CSV file
#df.to_csv(r"/Users/Devinpowers/Desktop/Spotify-Projects/testing-new.csv", index=False)
## Drop columns: Song Name and Times Played
df = df.drop(columns= ['Song Name', 'times_played'])
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 acousticness 84 non-null float64
1 danceability 84 non-null float64
2 energy 84 non-null float64
3 liveness 84 non-null float64
4 loudness 84 non-null float64
5 tempo 84 non-null float64
6 valence 84 non-null float64
7 speechiness 84 non-null float64
dtypes: float64(8)
memory usage: 5.4 KB
Standardize Data
data = ( df_new - df_new.mean())/df_new.std()
df_new = data
df_new.head()
Output:
acousticness danceability energy liveness loudness tempo valence speechiness label
0 -0.671343 -0.432540 0.720365 -0.708034 0.343690 0.040517 -1.230519 -0.327739 0.037475
1 -0.687017 -0.155617 0.594302 -0.716424 -0.323584 0.239459 0.240546 -0.311314 0.037475
2 -0.699557 -0.437669 1.263410 -0.183638 1.052299 0.596373 -0.120128 -0.353524 0.037475
3 -0.669103 0.300794 0.827035 0.105828 0.249208 -0.324517 0.471540 -0.341688 0.037475
4 -0.566994 1.649514 0.109440 -0.179443 1.323933 -0.729783 0.187863 -0.359562 -1.311636
Lets use the Package Plotly
import plotly.express as px
fig = px.scatter_matrix(df, width=1200, height=1600)
fig.show()
Output:
Lets Create a Scattered Plot with 4 Features
fig1 = px.scatter(df, x="energy", y="liveness", color="tempo",size="valence")
fig1.update_layout(title="4 Features Representation")
fig1.show()
Output:
Lets Create a Scattered Plot with 5 Features
fig2 = px.scatter_3d(df, x="energy", y="liveness", z="loudness",color="tempo",size="valence")
fig2.update_layout(title="5 Features Representation")
fig2.show()
Output:
Another Plot
X = df_new
kmeans = KMeans(
n_clusters=8, init="k-means++",
n_init=10,
tol=1e-04, random_state=42
)
kmeans.fit(X)
clusters=pd.DataFrame(X,columns=X.columns)
clusters['label']=kmeans.labels_
polar=clusters.groupby("label").mean().reset_index()
polar=pd.melt(polar,id_vars=["label"])
fig4 = px.line_polar(polar, r="value", theta="variable", color="label", line_close=True,height=800,width=1400)
fig4.show()
Output:
Step 1: Data Preparation
We will be using dataset from kaggle that you can “download here”. Now we can explore this dataset and prepare it to build our machine learning model
Import libaries needed:
import pandas as pd
import numpy as np
import json
import re
Open file in pandas dataframe:
- Open the file from Kaggle
spotify_df = pd.read_csv('data.csv')
spotify_df.head()
Output:
valence | year | acousticness | artists | duration_ms | energy | explicit | id | instrumentalness | key | liveness | loudness | mode | name | popularity | release_date | speechiness | tempo | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0594 | 1921 | 0.982 | [‘Sergei Rachmaninoff’, ‘James Levine’, ‘Berli… | 0.279 | 831667 | 0.211 | 0 | 4BJqT0PrAfrxzMOxytFOIz | 0.878000 | 10 | 0.665 | -20.096 | 1 | Piano Concerto No. 3 in D Minor, Op. 30: III. … | 4 | 1921 | 0.0366 | 80.954 |
1 | 0.9630 | 1921 | 0.732 | [‘Dennis Day’] | 0.819 | 180533 | 0.341 | 0 | 7xPhfUan2yNtyFG0cUWkt8 | 0.000000 | 7 | 0.160 | -12.441 | 1 | Clancy Lowered the Boom | 5 | 1921 | 0.4150 | 60.936 |
2 | 0.0394 | 1921 | 0.961 | [‘KHP Kridhamardawa Karaton Ngayogyakarta Hadi… | 0.328 | 500062 | 0.166 | 0 | 1o6I8BglA6ylDMrIELygv1 | 0.913000 | 3 | 0.101 | -14.850 | 1 | Gati Bali | 5 | 1921 | 0.0339 | 110.339 |
3 | 0.1650 | 1921 | 0.967 | [‘Frank Parker’] | 0.275 | 210000 | 0.309 | 0 | 3ftBPsC5vPBKxYSee08FDH | 0.000028 | 5 | 0.381 | -9.316 | 1 | Danny Boy | 3 | 1921 | 0.0354 | 100.109 |
4 | 0.2530 | 1921 | 0.957 | [‘Phil Regan’] | 0.418 | 166693 | 0.193 | 0 | 4d6HGyGT8e121BsdKmw9v6 | 0.000002 | 3 | 0.229 | -10.096 | 1 | When Irish Eyes Are Smiling | 2 | 1921 | 0.0380 | 101.665 |
Open File containing data with genres
data_w_genre = pd.read_csv('data_w_genres.csv')
data_w_genre.head()
Output:
genres | artists | acousticness | danceability | duration_ms | energy | instrumentalness | liveness | loudness | speechiness | tempo | valence | popularity | key | mode | count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [‘show tunes’] | “Cats” 1981 Original London Cast | 0.590111 | 0.467222 | 250318.555556 | 0.394003 | 0.011400 | 0.290833 | -14.448000 | 0.210389 | 117.518111 | 0.389500 | 38.333333 | 5 | 1 | 9 |
1 | [] | “Cats” 1983 Broadway Cast | 0.862538 | 0.441731 | 287280.000000 | 0.406808 | 0.081158 | 0.315215 | -10.690000 | 0.176212 | 103.044154 | 0.268865 | 30.576923 | 5 | 1 | 26 |
2 | [] | “Fiddler On The Roof” Motion Picture Chorus | 0.856571 | 0.348286 | 328920.000000 | 0.286571 | 0.024593 | 0.325786 | -15.230714 | 0.118514 | 77.375857 | 0.354857 | 34.857143 | 0 | 1 | 7 |
3 | [] | “Fiddler On The Roof” Motion Picture Orchestra | 0.884926 | 0.425074 | 262890.962963 | 0.245770 | 0.073587 | 0.275481 | -15.639370 | 0.123200 | 88.667630 | 0.372030 | 34.851852 | 0 | 1 | 27 |
4 | [] | “Joseph And The Amazing Technicolor Dreamcoat”… | 0.510714 | 0.467143 | 270436.142857 | 0.488286 | 0.009400 | 0.195000 | -10.236714 | 0.098543 | 122.835857 | 0.482286 | 43.000000 | 5 | 1 | 7 |
Observation about Dataset:
- This dataset is at a artist level so we get to see
Lets check the datatypes of this dataset
data_w_genre.dtypes
data_w_genre['genres'].values[0]
testing:
- Something something
testing whatever this does
- Update something
Step 2: Feature Engineering
Step 3: Connect to our Postgres Database
- Connect to our Spotify Database and run a query to extract our popular songs (from past 7 days) to build a reccomondation system off of