Spotify ETL Project

25 minute read

Extract Transform Load Project using Spotify API!

Link to Github Repository

  • Work in Progress

Goals:

  • Learn about ETL Process using Python
  • Creating Databases, Star schema for designing Warehouse/Tables
  • Writing more complex Queries in SQL (Joins, Functions, Subqueries)
  • Work with APIs (Spotify in this problem)
  • Work with Apache Airflow to automate and schedule tasks
  • Learn about Machine Learning and recommendating songs (Not finished yet)

We are working with the Spotify API, so were going to use the library Spotipy

Project Scope:

  • We’re going to build a ETL Pipeline that extracts the most recently listen to music on Spotify into a Database using

Tools used:

Create an App in Spotify Developer to use

The first step to working with the Spotify API is to create a new App, go here and follow the steps below to create a New App:

  1. Copy the Spotify Client Id down
  2. Copy the Spotify Client Secret down
  3. Go to Edit Settings and under Redirect URIs put http://localhost and then save

Create Database and populate with our Tables in Postgres

First we created a database named Spotify or whatever you’d like to call it in Postgres to store our tables in.

CREATE DATABASE Spotify;

You can spend time looking different responses for different API calls, for this project we will make 4 tables to store the data.

  • Using the Star Schema for this Database since it is the most widely used to develop data warehouses. We will create a fact table named spotify_track and three dimensional tables named spotify_album, spotify_artists, and spotify_audio_analysis. The fact table will reference on album_id for the spotify_album table, artist_id on the spotify_artists and song_id on the spotify_audio_analysis table.

"Insert E/R Diagram"

CREATE TABLE  spotify_track(
    unique_id TEXT PRIMARY KEY NOT NULL,
    song_id TEXT NOT NULL,
    song_name TEXT,
    duration_ms INTEGER,
    url TEXT,
    date_time_played TIMESTAMP,
    album_id TEXT,
    artist_id TEXT,
    date_time_inserted TIMESTAMP DEFAULT CURRENT_TIMESTAMP -- time we insert it into the table
    );

CREATE TABLE spotify_album(
    album_id TEXT NOT NULL PRIMARY KEY,
    album_name TEXT,
    release_date TEXT,
    url TEXT,
    image_url TEXT
    );

CREATE TABLE spotify_artists(
    artist_id TEXT PRIMARY KEY,
    artist_name TEXT,
    url TEXT);

CREATE TABLE  spotify_audio_analysis(
    song_id TEXT NOT NULL PRIMARY KEY,
    acousticness decimal(5,4),
    danceability decimal(5,3),
    energy decimal(5,3),
    liveness decimal(5,3),
    loudness decimal(5,2),
    tempo decimal(5,2),
    valence decimal(5,3),
    speechiness decimal(5,5),
    instrumentalness decimal(5,5) );

Now that we’ve created our Database and the tables we can start writing a Python Script to work with Spotify API!

Add Libaries needed!

import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd
import sys
import psycopg2
from sqlalchemy import create_engine

We are going to be using different libaries in Python including spotipy which works with Spotify API!

For our database we’re going to be using PostgresSQL so we will use psycopg2 and sqlalchemy to work with the database inside Python! The last libary that we will use will be pandas which will allow use to work/change our data around!

Python Code to extract Spotify Data

Connect to Spotify via API

Here we can connect to Spotify API and use our client id and client secret here. For our project were returning current_user_recently_played, which will return the 50 most recent played songs.

# Creating Instance to work with Spotify Api
spotify_client_id = " " ## ADDD HERE
spotify_client_secret = ""     ### ADD HERE 
spotify_redirect_url = "http://localhost"  

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=spotify_client_id,
                                                client_secret=spotify_client_secret,
                                                redirect_uri=spotify_redirect_url,
                                                scope="user-read-recently-played"))

# Will return 50 most recent played songs
recently_played = sp.current_user_recently_played(limit=50)

Returned to us will be a JSON Object.

Note: Using a JSON converter to organize the JSON code is a amazing! JSON Converter

Otherwise it’s super consuing

Iterate through JSON, Start Creating our Tables

After looking at the JSON code, we can parse through the data. Here I am using lists for each category, which we can append to when we iterate through the JSON response. Then at the end we can create a dictionary and combine all our lists to create a dictionary of our columns.

# Creating the Album Data Structure:
album_dict = {}
album_id = []
album_name = []
album_release_date = []
album_url = []
album_image = []

for row in recently_played['items']:

    album_id.append(row['track']['album']['id'])
    album_name.append(row['track']['album']['name'])
    album_release_date.append(row['track']['album']['release_date'])
    album_url.append(row['track']['album']['external_urls']['spotify'])
    album_image.append(row['track']['album']['images'][1]['url'])

album_dict = {'album_id':album_id,'album_name':album_name,'release_date':album_release_date,'url':album_url, 'image_url': album_image }


# Creating the Artist Data Structure !

artist_dict = {}
id_list = []
name_list = []
url_list = []

for row in recently_played['items']:

    id_list.append(row['track']['artists'][0]['id'])
    name_list.append(row['track']['artists'][0]['name'])
    url_list.append(row['track']['artists'][0]['external_urls']['spotify'])

artist_dict = {'artist_id':id_list,'artist_name':name_list,'url':url_list} # Combine
 
# For Track Data Structure 

song_dict = {}
song_id = []
song_name = []
song_duration = []
song_url = []
song_time_played = []
album_id = []
song_element = []
artist_id = []

for row in recently_played['items']:

    song_id.append(row['track']['id'])
    song_name.append(row['track']['name'])
    song_duration.append(row['track']['duration_ms'])
    song_url.append(row['track']['external_urls']['spotify'])

    song_time_played.append(row['played_at'])
    album_id.append(row['track']['album']['id'])
    artist_id.append(row['track']['album']['artists'][0]['id'])

song_dict = {'song_id':song_id,'song_name':song_name,'duration_ms':song_duration,'url':song_url,
                    'date_time_played':song_time_played,'album_id':album_id,
                    'artist_id':artist_id
                    }

Convert Dictionaries to Dataframes

Now we have extracted data from Spotify into 3 Dictionaries.

We will look at the audio analysis table later…

Now we can convert the dictionaries to Dataframes using Pandas libary and drop any duplicates using the drop_duplicates function.

Album_df = pd.DataFrame.from_dict(album_dict)
Album_df = Album_df.drop_duplicates(subset=['album_id'])

Artist_df = pd.DataFrame.from_dict(Artist_Dict)
Artist_df = Artist_Df.drop_duplicates(subset=['artist_id'])

Song_df = pd.DataFrame.from_dict(song_dict)

Adjust Song Dataframes

For the Song_df, after we convert it to a Dataframe we have to adjust a few things:

  1. When we get the date when the song was played at it’s returned into: “2022-04-25T17:53:17.481Z”, in order to convert this we must:

a. Use to_datetime which is a funciton that converts a scalar, array-like, Series or DataFrame/dict-like to a pandas datetime object, which is a Timestamp.

From 2022-04-28T20:46:55.013Z to -> 2022-04-28 20:46:55.013000+00:00

b. Convert to the Eastern Time Zone

From 2022-04-28 20:46:55.013000+00:00 to -> 2022-04-28 16:46:55.013000-04:00

c. Remove the timezone part from the date/time/timezone.

From 2022-04-28 16:46:55.013000-04:00 to -> 2022-04-28 16:57:19.77400

d. Convert back to datetime datatype

e. Lets create a Unix Timestamp for time played, which will help use represent unique identifier, which will create another column in our dataframe

f. Now we can create a column for unique for each time a song is played! We can name the new column unique_id that will be one-half our song_id and the second-half will be our UNIX value that we just created above.

Example: 7lQ8MOhq6IN2w8EYcFNSUk + 1651165753 = 7lQ8MOhq6IN2w8EYcFNSUk-1651165753

g. Now we can add the new column unique_id to our dataframe.

#a
Song_df['date_time_played'] = pd.to_datetime(Song_df['date_time_played'])

#b
Song_df['date_time_played'] = Song_df['date_time_played'].dt.tz_convert('US/Eastern')

#c
Song_df['date_time_played'] = Song_df['date_time_played'].astype(str).str[:-7]

#d
Song_df['date_time_played'] = pd.to_datetime(Song_df['date_time_played'])

#e
Song_df['UNIX_Time_Stamp'] = (Song_df['date_time_played'] - pd.Timestamp("1970-01-01"))//pd.Timedelta('1s')

#f
Song_df['unique_id'] = Song_df['song_id'] + "-" + Song_df['UNIX_Time_Stamp'].astype(str)

#g
Song_df = Song_df[['unique_id','song_id','song_name','duration_ms','url','date_time_played','album_id','artist_id']]

Connnect to Postgres Database

Now with our Dataframes cleaned and ready, we can connect to our Spotify Database that we made earlier and populate it with the data from our Dataframes.

Let’s First Connect to our Postgres DataBase using psycopg2, wwhere:

  1. host = LocalHost (on my laptop)

  2. dbname = Name of Database we made above (Spotify)

  3. port = whatever port is set to

  4. user = Whatever user you used in your postgres database for the DB you made above

In *engine = create_engine(‘postgresql+psycopg2://@/’) we can replace with postgresql+psycopg2://user@host:port/dbname, with user being user, port, and dbname!

More on Psycopg2 Library here

conn = psycopg2.connect(host = "",user = "", port="", dbname = "")
cur = conn.cursor()

engine = create_engine('postgresql+psycopg2://@/') 
conn_eng = engine.raw_connection()
cur_eng = conn_eng.cursor()

Add Data to SQL Tables

Now that we are connected with our Postgres Database we can use the Engine to execute some SQL Queries.

A Strategy we can use to import into SQL Tables is to create a temp table (which mirror its equivalent table) and then query temp table into our actual table.

Using CRUD (Create, Read, Update, Delete), so we create temp tables then join our actual table in our database. We will use this method below for inserting data into our tables!

# Importing the Song_df into the SQL table

cur_eng.execute(
"""
CREATE TEMP TABLE temp_track AS SELECT * FROM spotify_track LIMIT 0
""")
Song_df.to_sql("temp_track", con = engine, if_exists='append', index = False)

# Now we can move the data from the temp table into our actual table

cur.execute(
"""
INSERT INTO spotify_track
    SELECT *
    FROM   temp_track
    LEFT   JOIN spotify_track USING (unique_id)
    WHERE  spotify_track.unique_id IS NULL;
    
    DROP TABLE temp_track
""")
conn.commit()

We can do the same for the other two tables.

cur_eng.execute(
"""
CREATE TEMP TABLE  temp_album AS SELECT * FROM spotify_album LIMIT 0
""")
album_df.to_sql("temp_album", con = engine, if_exists='append', index = False)
conn_eng.commit() # running a query!
#Moving from Temp Table to Production Table
cur.execute(
"""
INSERT INTO spotify_album
SELECT *
FROM   temp_album
LEFT   JOIN spotify_album USING (album_id)
WHERE  spotify_album.album_id IS NULL;

DROP TABLE temp_album""")
conn.commit()


# Artists Table
cur_eng.execute(
"""
CREATE TEMP TABLE temp_artist AS SELECT * FROM spotify_artists LIMIT 0
""")
artist_df.to_sql("temp_artist", con = engine, if_exists='append', index = False) # command
conn_eng.commit()
#Moving data from temp table to production table
cur.execute(
"""
INSERT INTO spotify_artists
SELECT *
FROM   temp_artist
LEFT   JOIN spotify_artists USING (artist_id)
WHERE  spotify_artists.artist_id IS NULL;

DROP TABLE temp_artist""")
conn.commit()

Now we can move onto the other table in our database.

Spotify Audio Analysis

In a similar process we will connect to the Spotify API again and use the recently played tracks to grab the song_id and song_name and place them into a list. We will use the list to loop through and look up each song (using song_id) using .audio_features function from the Spotifpy library and extract different audio features including acousticness, danceability, liveness, and energy for each song.

# Connect to Spotify Again
spotify_client_id = " " ## ADDD HERE
spotify_client_secret = ""     ### ADD HERE 
spotify_redirect_url = "http://localhost"  

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=spotify_client_id,
                                                client_secret=spotify_client_secret,
                                                redirect_uri=spotify_redirect_url,
                                                scope="user-read-recently-played"))

recently_played = sp.current_user_recently_played(limit=50) 

# Return dataframe of song_id
song_id = []
song_name = []

for row in recently_played['items']:

    song_id.append(row['track']['id'])
    song_name.append(row['track']['name'])

track_analysis = {}
song_id_list = []
acousticness = []
danceability = []
energy = []
liveness = []
loudness = []
tempo = []
valence = []
instrumentalness = []
speechiness = []

for song in song_id:

    analysis_ = sp.audio_features(song)
    song_id_list.append(song)
    # Iterate through analysis and add columns to list!!!
    acousticness.append(analysis_[0]['acousticness'])
    danceability.append(analysis_[0]['danceability'])
    energy.append(analysis_[0]['energy'])
    liveness.append(analysis_[0]['liveness'])
    loudness.append(analysis_[0]['loudness'])
    tempo.append(analysis_[0]['tempo'])
    valence.append(analysis_[0]['valence'])
    speechiness.append(analysis_[0]['speechiness'])
    instrumentalness.append(analysis_[0]['instrumentalness'])
    

track_analysis = {'song_id':song_id_list,'acousticness':acousticness,'danceability':danceability, 'energy': energy,'liveness': liveness, 'loudness' : loudness, 'tempo': tempo,'valence': valence, 'instrumentalness': instrumentalness, 'speechiness': speechiness }

# Convert from Dictionary to Dataframe using Pandas, keep = False which is to drop the indexes

track_analysis_df = pd.DataFrame.from_dict(track_analysis)
track_analysis_df = track_analysis_df.drop_duplicates(subset=['song_id'], keep = False)

Add Audio Features to Postgres Database

Just like what we did for the other 3 tables before, we will connect to postgres database and add audio features to the table.

'''Connect to our Postgres Database'''
conn = psycopg2.connect(host = "localhost", user = "devinpowers",port="5433", dbname = "spotify")
cur = conn.cursor()
engine = create_engine('postgresql+psycopg2://devinpowers@localhost:5433/spotify') # this is 
conn_eng = engine.raw_connection()
cur_eng = conn_eng.cursor()


cur_eng.execute(
"""
CREATE TEMP TABLE IF NOT EXISTS tmp_spotify_audio_new AS SELECT * FROM spotify_audio_analysis LIMIT 0
""")

data_frame.to_sql("tmp_spotify_audio_new", con = engine, if_exists='append', index = False)

#Moving data from temp table to production table
cur.execute(
"""
    INSERT INTO spotify_audio_analysis
    SELECT * FROM tmp_spotify_audio_new
    WHERE tmp_spotify_audio_new.song_id NOT IN (SELECT song_id FROM spotify_audio_analysis );

    DROP TABLE tmp_spotify_audio_new
    """)
conn.commit()

Now we have all our data extract and transformed our data from Spotify to our Postgres Database!!

SQL Queries to Extract information from our Spotify Database

With our database full of information, we can run queries (turn them into functions) to show statistics from our most recently played music (eg. think our top 10 Artists played the past 7 days)

Lets look at some sample queries that we can run against our database.

Top 10 Albums Played this week

Return the top 10 Albums that I listened to this week and a count of the number of times I listened to the album!

SELECT artist.name AS Artist_name,album.name AS Album_Name, COUNT(track) AS Number_of_Plays
FROM spotify_track AS track
INNER JOIN spotify_album AS album ON track.album_id = album.album_id
INNER JOIN spotify_artists AS artist ON track.artist_id = artist.artist_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY album.name, Artist_name
ORDER BY  Number_of_Plays DESC
LIMIT 10;

Output if we ran query above

artist_name album_name number_of_plays
Outkast Stankonia 34
Outkast Speakerboxxx/The Love Below 29
The Rolling Stones Aftermath 16
The Rolling Stones Some Girls 14
The Rolling Stones Exile On Main Street (2010 Re-Mastered) 11
Metallica Garage, Inc. 10
Outkast ATLiens 10
Janis Joplin Pearl (Legacy Edition) 9
Red Hot Chili Peppers Stadium Arcadium 8
Lil Durk Broadway Girls (feat. Morgan Wallen) 7

Top 10 Artists Played this week

Returns the Top 10 Artists I listened to this week and a count of the number of times I listened to them!

SELECT artist.name, COUNT(track) AS Number_of_Plays
FROM spotify_track AS track
INNER JOIN spotify_artists AS artist ON track.artist_id=artist.artist_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY artist.name
ORDER BY Number_of_Plays DESC
LIMIT 10;

Top 5 Most Danceable Songs listened to

Returns the Top 5 songs that had the highest danceability score from the past week.

SELECT st.song_name, sa.name AS artist_name, aas.danceability
FROM spotify_track AS st
INNER JOIN spotify_artists AS sa ON st.artist_id = sa.artist_id
JOIN spotify_audio_analysis AS aas ON st.song_id = aas.song_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY st.song_name, sa.name, danceability
ORDER BY danceability DESC
LIMIT 5;

Return the top 4 albums of the week along with the album url link and the album link.

SELECT album.name, album.url AS album_url, album.image_url AS album_cover, COUNT(track.*)::INT AS number_plays
FROM spotify_track AS track
INNER JOIN spotify_album AS album ON track.album_id = album.album_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY album.name, album.url, album.image_url
ORDER BY number_plays DESC
LIMIT 4;

Total Time

Returns the total time spend listening to Spotify in the past week in hours

SELECT ROUND(SUM(CAST (duration_ms AS decimal)/3600000),2) AS total_hours_listened
FROM spotify_track
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days';

Returns the 5 most popular songs listened to this week along with the artist name and the count of times played.

SELECT st.song_name, sa.name AS artist_name,COUNT(st.*) AS times_played
FROM spotify_track AS st
INNER JOIN spotify_artists AS sa 
ON st.artist_id = sa.artist_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY st.song_name, sa.name
ORDER BY times_played DESC
LIMIT 5;

5 Loudest Songs of the Week

Return the top 5 loudest songs, loudest is found using the loudness column.

SELECT st.song_name, sa.name AS artist_name, aas.loudness
FROM spotify_track AS st
INNER JOIN spotify_artists AS sa ON st.artist_id = sa.artist_id
JOIN spotify_audio_analysis AS aas ON st.song_id = aas.song_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY st.song_name, sa.name, loudness
ORDER BY loudness ASC
LIMIT 5;

Top 5 Good Feeling Songs listened to the last 7 days (High Valence)

Return the TOp 5 Feel good songs of the week!

SELECT st.song_name, sa.name AS artist_name, aas.valence
FROM spotify_track AS st
INNER JOIN spotify_artists AS sa ON st.artist_id = sa.artist_id
JOIN spotify_audio_analysis AS aas ON st.song_id = aas.song_id
WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
GROUP BY st.song_name, sa.name, valence
ORDER BY valence DESC
LIMIT 5;

Top Decades

Lets try to create a view in SQL that uses Case Statements to find what decade our song(s) were released.

We can alter our release_date coloumn by:

CAST(SPLIT_PART(release_date,’-‘,1) AS INT) will split dates from YYYY-MM-DD to YYYY and save it as an Integer

Where:

PostgreSQL CAST is used to convert from one data type into another.

PostgreSQL SPLIT_PART is used to split a string -> SPLIT_PART(string, delimiter, position), where string is the string to be split, delimiter is the string used as the delimiter for splitting, and position is the part to return.

Then create a subquery that we can perform the View over.

CREATE OR REPLACE VIEW track_decades AS
SELECT *,
CASE 
	WHEN subquery.release_year >= 1940 AND subquery.release_year <= 1949  THEN '1940''s'
	WHEN subquery.release_year >= 1950 AND subquery.release_year <= 1959  THEN '1950''s'
	WHEN subquery.release_year >= 1960 AND subquery.release_year <= 1969  THEN '1960''s'
	WHEN subquery.release_year >= 1970 AND subquery.release_year <= 1979  THEN '1970''s'
	WHEN subquery.release_year >= 1980 AND subquery.release_year <= 1989  THEN '1980''s'
	WHEN subquery.release_year >= 1990 AND subquery.release_year <= 1999  THEN '1990''s'
	WHEN subquery.release_year >= 2000 AND subquery.release_year <= 2009  THEN '2000''s'
	WHEN subquery.release_year >= 2010 AND subquery.release_year <= 2019  THEN '2010''s'
	WHEN subquery.release_year >= 2020 AND subquery.release_year <= 2029  THEN '2020''s'
ELSE 'Other'
END AS decade

FROM 

(SELECT album.album_id,album.name,album.release_date,track.unique_identifier,track.date_time_played,track.song_name,CAST(SPLIT_PART(release_date,'-',1) AS INT) AS release_year
FROM spotify_album AS album
INNER JOIN spotify_track AS track ON track.album_id = album.album_id) AS subquery;

Turning Queries into Functions

We can turn any query into a function (Stored Procedures) that we can resuse later when were sending out our email.

Use ::* in Postgres to cast from one datatype to another, which is INT.

CREATE FUNCTION function_last_7_days_artist_played() 
RETURNS TABLE (name TEXT, number_plays INT) LANGUAGE plpgsql AS $$ 
BEGIN 
  RETURN query 
 SELECT art.name, COUNT(track.*):: INT AS number_plays
    FROM  spotify_track AS track
    INNER JOIN spotify_artists AS art ON track.artist_id=art.artist_id
    WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
    GROUP BY art.name
    ORDER BY number_plays DESC
    LIMIT 5;
end;$$ 

Creating a Email to Send

In the next step of the ETL Project we can write a python script that will send us an email with our stats from the week. You can learn sending emails using Automating the Boring Stuff Page.

To send an exmail via smtplib, you have to create a Google App which will create a password for you to login to to access the app.

Connecting to Postgres Database

First we must connect to our database and then we can use the functions and run them on our database.

import psycopg2
conn = psycopg2.connect(host = "localhost", user = "devinpowers",port="5433", dbname = "spotify")
cur = conn.cursor()

Create Lists so store information

Once connected to our database we can use callproc function to run our functions (saved queries) over our database and store the returned values into a list.

We can use the fetchall() function to extract from our query output and put the data into lists. Then take our lists and print them out using tabulate

from datetime import datetime, timedelta

today = datetime.today().date()
six_days_ago = today - timedelta(days=6)

#Top 5 Songs by Time Listened (MIN)
top_5_songs_min = []

cur.callproc('function_last_7_days_top_5_songs_duration')
for row in cur.fetchall():
    song_name = row[0]
    min_listened = float(row[1])
    element = [song_name,min_listened]
    top_5_songs_min.append(element)


# Finding the time in hours spent listening the past 7 days
cur.callproc('function_last_7_days_hrs_listened_')
total_time_listened_hrs = float(cur.fetchone()[0]) # Convert to float 


#Top 5 Songs and Artists by Times Played
top_songs_art_played = []
cur.callproc('function_last_7_days_songs_artist_played')
for row in cur.fetchall():
    song_name = row[0]
    artist_name = row[1]
    times_played = int(row[2])
    element = [song_name,artist_name,times_played]
    top_songs_art_played.append(element)


#Top Decades:
top_decade_played = []
cur.callproc('function_last_7_days_top_decades')
for row in cur.fetchall():
    decade = row[0]
    times_played = int(row[1])
    element = [decade,times_played]
    top_decade_played.append(element)

#Top Artists Played
top_art_played = []
cur.callproc('function_last_7_days_artist_played')
for row in cur.fetchall():
    artist_name = row[0]
    times_played = int(row[1])
    element = [artist_name,times_played]
    top_art_played.append(element)


#Collecting Top Albums of the Week
album_url = []
track_url= []
album_name = []
cur.callproc('function_albums')
for row in cur.fetchall():
    
    album_name.append(row[0])
    album_url.append(row[1])
    track_url.append(row[2])

Creating the Email

The email format is made using html.

import smtplib,ssl
import json
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from tabulate import tabulate


#Sending the Email:
port = 456
password = ""

sender_email = "devinjpowers@gmail.com"
receiver_email = "powers88@msu.edu"

message = MIMEMultipart("alternative")
message["Subject"] = f"Spotify - Weekly Roundup for - {today}"
message["From"] = sender_email
message["To"] = receiver_email


text = f"""\
Here are your stats for your weekly round up for Spotify. 
Dates included: {six_days_ago} - {today}:

"""
html = f"""\
<html>
    <body>
        <h4>
        Hello Mr.Powers, here are your stats for your weekly round up for Spotify.
        </h4>
        <h4>
        Dates included: {six_days_ago} - {today}
        </h4>
        <h4>
        Total Time Listened: {total_time_listened_hrs} hours.
        </h4>
        <h4>
        Here are your Top Albums of the Week:
        </h4>
        <br>
        1. {album_name[0]}
        <br>
        2. {album_name[1]}
        <br>
        3. {album_name[2]}
        <br>
        4. {album_name[3]} 
        </br>
        
        <a href = {album_url[0]}>
            <img src= {track_url[0]}  alt="idk" width="300" height="300" class = "center">
        </a>
        
        <a href = {album_url[1]}>
        <img src= {track_url[1]}  alt="idk" width="300" height="300" class = "center">
        </a>
        
        <a href = {album_url[2]}>
        <img src= {track_url[2]}  alt="idk" width="300" height="300" class = "center">
        </a>

        <a href = {album_url[3]}>
        <img src= {track_url[3]}  alt="idk" width="300" height="300" class = "center">
        </a>

        <p>

        <h4>
        You listened to these songs and artists a lot here are your top 5!
        </h4>
        {tabulate(top_songs_art_played, headers = ["Song Name", "Artist Name", "Times Played"], tablefmt='html')}
        <h4>
        You spend a lot of time listening to these songs!
        </h4>
        {tabulate(top_5_songs_min, headers = ['Song Name', 'Time (Min)'], tablefmt='html')}
        <h4>
        Here are your Top 10 played artists!
        </h4>
        {tabulate(top_art_played, headers = ['Artist Name','Times Played'], tablefmt='html')}
        
        <h4>
        Here are your Top 10 most played albums!
        </h4>
        {tabulate(top_albums_played, headers = ['Album Name', 'Times Played'], tablefmt='html')}
        <h4>
        Here are your top decades:
        </h4>
        {tabulate(top_decade_played, headers = ['Decade','Times Played'], tablefmt='html')}
            <h4>
            Lastly your top danceable songs are here as the following:
        </h4>
        {tabulate(top_dance_songs, headers = ['Song Name', 'Artist', 'Danceability'], tablefmt='html')}
            <h4>
            Lets Dance
            </h4>
            <a href = {dance_album_url[0]}>
            <img src= {dance_album_cover[0]}  alt="idk" width="300" height="300" class = "center">
        </a>
        </p>
        <br>
            <h4>
            Machine Learning using recent played music to recommend new music....coming soon... 
            </h4>
            </br>
        
        
    </body>
</html>"""


# Send Email!
part1 = MIMEText(text,"plain")
part2 = MIMEText(html,"html")


message.attach(part1)
message.attach(part2)

context = ssl.create_default_context()
with smtplib.SMTP_SSL("smtp.gmail.com",port,context = context) as server:
    server.login("devinjpowers@gmail.com",password)
    server.sendmail(sender_email,receiver_email,message.as_string())

When you click on of the 4 Album Cover images in the email, a hyperlink will send you to the album Spotify Page! Check it out below:

Example Email Send:

Email

"insert"

"insert"

"insert"

"insert"

Tweet your Top Played Songs

In order to tweet via a Twitter account, you need to set up a Twitter Developer account and create a app and grab the API keys and tokens

import psycopg2
import smtplib,ssl
import json
from tabulate import tabulate
from datetime import datetime, timedelta

#Importing the module for Twitter!
import tweepy

# Keep Secret!
consumer_key=""
consumer_secret_key=""
access_token = ""
access_token_secret = ""

# Connect to my database and then run Queries over
conn = psycopg2.connect(host = "localhost", user = "devinpowers",port="5433", dbname = "spotify")

cur = conn.cursor()


today = datetime.today().date()


# Top 5 Songs I listened to this past Week

top_5_songs_min = []



# callproc: this method calls the stored procedure named by the proc_name argument.
cur.callproc('function_last_7_days_top_5_songs_duration')
for row in cur.fetchall():
    song_name = row[0]
    min_listened = float(row[1])
    element = [song_name,min_listened]
    
    top_5_songs_min.append(element)

# Find Album cover photo

album_url = []
track_url= []
album_name = [] 
cur.callproc('function_albums')
for row in cur.fetchall():
    
    album_name.append(row[0])
    album_url.append(row[1])
    track_url.append(row[2])



# list to individual datatypes to store one song (so need 5)

song1 =  top_5_songs_min[0][0]
song2 =  top_5_songs_min[1][0]
song3 =  top_5_songs_min[2][0]
song4 =  top_5_songs_min[3][0]
song5 =  top_5_songs_min[4][0]



# Call function to return total time listened to this past past!!


cur.callproc('function_last_7_days_hrs_listened_')
total_time_listened_hrs = float(cur.fetchone()[0])


# Twitter Stuff!!

auth = tweepy.OAuthHandler(consumer_key, consumer_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

tweet1 = f"""\
This week i've listen to {total_time_listened_hrs} of music! Here are my top 5 songs of the week!

1. {song1}
2. {song2}
3. {song3}
4. {song4}
5. {song5}
"""


tweet2 =  f"""\
Link to Top Album of the Week

{album_url[0]}

"""

# List of tweets to send
tweets = [tweet1, tweet2]

try:
    for tweet in tweets:
        api.update_status(tweet)
        print("Posted!")

except tweepy.error.TweepError as e:
    print(e)

Tweet Output:

Automate with Airflow

We can automate all of our scripts using Apache Airflow!

Note: I havent really tested these…

from datetime import timedelta
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago
from airflow.utils.dates import days_ago

from spotify_etl_pipeline import etl_func1, etl_func2 # assuming we name our functions etl_func1 and etl_func2


my_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': days_ago(2),
    'email': ['devinjpowers@gmail.com'],
    'email_on_failure': True,
    'email_on_retry': True,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}
my_dag = DAG(
    'spotify_dag',
    default_args = my_args,
    description= 'Spotify ETL Process',
    schedule_interval= '0 14 * * *'
)


run_etl1 = PythonOperator(
    task_id='spotify_etl_1`',
    python_callable= etl_func1,
    dag=my_dag
)
run_etl2 = PythonOperator(
    task_id='spotify_etl_2',
    python_callable= etl_func2,
    dag=my_dag
)

run_etl > run_etl2                          # Doesn't really matter which one gets run first!

Dag to Email

DAG to send our email!

from datetime import timedelta
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago

from Spotify_Email import spotify_etl_email
from airflow.utils.dates import days_ago

my_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': days_ago(2),
    'email': ['devinjpowers@gmail.com'],
    'email_on_failure': True,
    'email_on_retry': True,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}
my_dag = DAG(
    'spotify_etl_email',=
    default_args = my_args,
    description= 'Spotify Daily Email',
    schedule_interval= '5 14 * * 0'
)

run_email = PythonOperator(
    task_id='spotify_email_weekly_thing',
    python_callable= spotify_etl_email,
    dag=my_dag
)
run_email

Dag to Tweet

This Dag will run the function that will send tweet!

from datetime import timedelta
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.dates import days_ago

from Spotify_Twitter import spotify_etl_python # The function that has the twitter script
from airflow.utils.dates import days_ago

my_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': days_ago(2),
    'email': ['devinjpowers@gmail.com'],
    'email_on_failure': True,
    'email_on_retry': True,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}
my_dag = DAG(
    'spotify_etl_twitter',=
    default_args = my_args,
    description= 'Spotify Daily Email',
    schedule_interval= '5 14 * * 0'
)

# Use PythonOperator to send our program as a Dag!

run_twitter = PythonOperator(
    task_id='twitter_function',
    python_callable= spotify_etl_python,
    dag=my_dag
)
run_twitter

Machine Learning to Create Playlists from our Spotify Recently Played Playlist

Note: this is a work in Progress

Lets add on to our Project where we use Machine Learning to recommend songs and include it in our weekly email!

I’ve seen many videos of people using different audio stats from spotify songs and previous datasets (Kaggle) to build music reccomendations systems.

  • For this part of the Project I will use cosine similarity which measures the cosine of the angle between two vectors projected in a multi-dimensional space.

"insert"

Step 0: Explore our Spotify Data

Connect to our Postgres Database

import pandas as pd
import sys
import psycopg2
from sqlalchemy import create_engine


conn = psycopg2.connect(host = "localhost", user = "devinpowers",port="5432", dbname = "spotify")
cur = conn.cursor()
engine = create_engine('postgresql+psycopg2://devinpowers@localhost:5432/spotify')
conn_eng = engine.raw_connection()
cur_eng = conn_eng.cursor()

Run Qeury to Retrieve all songs played past 7 Days

cur = conn.cursor()
  
sql = '''SELECT song_name,COUNT(spotify_track.*)::INT AS times_played, acousticness, danceability, energy, liveness, loudness, tempo, valence, speechiness
        FROM spotify_audio_analysis
        JOIN spotify_track ON spotify_audio_analysis.song_id = spotify_track.song_id
        
        WHERE date_time_played > CURRENT_DATE - INTERVAL '7 days'
        
        GROUP BY song_name, acousticness, danceability, energy, liveness, loudness, tempo, valence, speechiness
        ORDER BY times_played DESC
    
        '''
cur.execute(sql)
results = cur.fetchall()

Work with the dataframe to clean and prepare for exploring.

# Convert list in Python to a dataframe using Pandas
df = pd.DataFrame(results)
df.columns = ['Song Name','times_played', 'acousticness', 'danceability', 'energy', 'liveness', 'loudness', 'tempo','valence', 'speechiness']

# Optional: Can save dataframe as a CSV file
#df.to_csv(r"/Users/Devinpowers/Desktop/Spotify-Projects/testing-new.csv", index=False)

## Drop columns: Song Name and Times Played
df = df.drop(columns= ['Song Name', 'times_played'])
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   acousticness  84 non-null     float64
 1   danceability  84 non-null     float64
 2   energy        84 non-null     float64
 3   liveness      84 non-null     float64
 4   loudness      84 non-null     float64
 5   tempo         84 non-null     float64
 6   valence       84 non-null     float64
 7   speechiness   84 non-null     float64
dtypes: float64(8)
memory usage: 5.4 KB

Standardize Data

data = ( df_new - df_new.mean())/df_new.std()

df_new = data

df_new.head() 

Output:

	acousticness	danceability	energy	liveness	loudness	tempo	valence	speechiness	label
0	-0.671343	-0.432540	0.720365	-0.708034	0.343690	0.040517	-1.230519	-0.327739	0.037475
1	-0.687017	-0.155617	0.594302	-0.716424	-0.323584	0.239459	0.240546	-0.311314	0.037475
2	-0.699557	-0.437669	1.263410	-0.183638	1.052299	0.596373	-0.120128	-0.353524	0.037475
3	-0.669103	0.300794	0.827035	0.105828	0.249208	-0.324517	0.471540	-0.341688	0.037475
4	-0.566994	1.649514	0.109440	-0.179443	1.323933	-0.729783	0.187863	-0.359562	-1.311636

Lets use the Package Plotly

import plotly.express as px
fig = px.scatter_matrix(df, width=1200, height=1600)
fig.show()

Output:

"insert"

Lets Create a Scattered Plot with 4 Features

fig1 = px.scatter(df, x="energy", y="liveness", color="tempo",size="valence")
fig1.update_layout(title="4 Features Representation")
fig1.show()

Output:

"insert"

Lets Create a Scattered Plot with 5 Features

fig2 = px.scatter_3d(df, x="energy", y="liveness", z="loudness",color="tempo",size="valence")
fig2.update_layout(title="5 Features Representation")
fig2.show()

Output:

"insert"

Another Plot

X = df_new
kmeans = KMeans(
        n_clusters=8, init="k-means++",
        n_init=10,
        tol=1e-04, random_state=42
    )

kmeans.fit(X)

clusters=pd.DataFrame(X,columns=X.columns)

clusters['label']=kmeans.labels_

polar=clusters.groupby("label").mean().reset_index()

polar=pd.melt(polar,id_vars=["label"])

fig4 = px.line_polar(polar, r="value", theta="variable", color="label", line_close=True,height=800,width=1400)
fig4.show()

Output:

"insert"

Step 1: Data Preparation

We will be using dataset from kaggle that you can “download here”. Now we can explore this dataset and prepare it to build our machine learning model

Import libaries needed:

import pandas as pd
import numpy as np
import json
import re 

Open file in pandas dataframe:

  • Open the file from Kaggle
spotify_df = pd.read_csv('data.csv')
spotify_df.head()

Output:

  valence year acousticness artists duration_ms energy explicit id   instrumentalness key liveness loudness mode name popularity release_date speechiness tempo
0 0.0594 1921 0.982 [‘Sergei Rachmaninoff’, ‘James Levine’, ‘Berli… 0.279 831667 0.211 0 4BJqT0PrAfrxzMOxytFOIz 0.878000 10 0.665 -20.096 1 Piano Concerto No. 3 in D Minor, Op. 30: III. … 4 1921 0.0366 80.954
1 0.9630 1921 0.732 [‘Dennis Day’] 0.819 180533 0.341 0 7xPhfUan2yNtyFG0cUWkt8 0.000000 7 0.160 -12.441 1 Clancy Lowered the Boom 5 1921 0.4150 60.936
2 0.0394 1921 0.961 [‘KHP Kridhamardawa Karaton Ngayogyakarta Hadi… 0.328 500062 0.166 0 1o6I8BglA6ylDMrIELygv1 0.913000 3 0.101 -14.850 1 Gati Bali 5 1921 0.0339 110.339
3 0.1650 1921 0.967 [‘Frank Parker’] 0.275 210000 0.309 0 3ftBPsC5vPBKxYSee08FDH 0.000028 5 0.381 -9.316 1 Danny Boy 3 1921 0.0354 100.109
4 0.2530 1921 0.957 [‘Phil Regan’] 0.418 166693 0.193 0 4d6HGyGT8e121BsdKmw9v6 0.000002 3 0.229 -10.096 1 When Irish Eyes Are Smiling 2 1921 0.0380 101.665

Open File containing data with genres

data_w_genre = pd.read_csv('data_w_genres.csv')
data_w_genre.head()

Output:

genres artists acousticness danceability duration_ms energy   instrumentalness liveness loudness speechiness tempo valence popularity key mode count
0 [‘show tunes’] “Cats” 1981 Original London Cast 0.590111 0.467222 250318.555556 0.394003 0.011400 0.290833 -14.448000 0.210389 117.518111 0.389500 38.333333 5 1 9
1 [] “Cats” 1983 Broadway Cast 0.862538 0.441731 287280.000000 0.406808 0.081158 0.315215 -10.690000 0.176212 103.044154 0.268865 30.576923 5 1 26
2 [] “Fiddler On The Roof” Motion Picture Chorus 0.856571 0.348286 328920.000000 0.286571 0.024593 0.325786 -15.230714 0.118514 77.375857 0.354857 34.857143 0 1 7
3 [] “Fiddler On The Roof” Motion Picture Orchestra 0.884926 0.425074 262890.962963 0.245770 0.073587 0.275481 -15.639370 0.123200 88.667630 0.372030 34.851852 0 1 27
4 [] “Joseph And The Amazing Technicolor Dreamcoat”… 0.510714 0.467143 270436.142857 0.488286 0.009400 0.195000 -10.236714 0.098543 122.835857 0.482286 43.000000 5 1 7

Observation about Dataset:

  • This dataset is at a artist level so we get to see

Lets check the datatypes of this dataset

data_w_genre.dtypes
data_w_genre['genres'].values[0]

testing:

  1. Something something testing whatever this does
  2. Update something

Step 2: Feature Engineering

Step 3: Connect to our Postgres Database

  • Connect to our Spotify Database and run a query to extract our popular songs (from past 7 days) to build a reccomondation system off of

Step 4: Create Playlist Vector

Step 5: Generate Recommendations

Step 6: Email us the Recommendation Playlists