Data Engineering: Data Warehouses

17 minute read

Key Takeaways:

  • Understand Data Warehousing Architecture [x]

  • Run an ETL process to Denormalize a database from (3NF to Star) [x]

  • Create an OLAP Cube from Facts and Dimensions [x]

  • Compare Columnar vs. Row Oriented Approaches

Intro to Data Warehouses Part 1: Understanding Architecture of Data Warehouses

  • Create Database pagila via Jupyter Notebook using the schema and data provided in the folder data
!PGPASSWORD=student createdb -h 127.0.0.1 -U devinpowers pagila
!PGPASSWORD=student psql -q -h 127.0.0.1 -U devinpowers -d pagila -f Data/pagila-schema.sql
!PGPASSWORD=student psql -q -h 127.0.0.1 -U devinpowers -d pagila -f Data/pagila-data.sql
  • Connect to database pagila

In Jupyter Notebook use %load_ext sql to load sql into notebook.

%load_ext sql

DB_ENDPOINT = "127.0.0.1"
DB = 'pagila'
DB_USER = 'devinpowers'
DB_PASSWORD = 'student'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)
%sql $conn_string

Should print out:

postgresql://devinpowers:student@127.0.0.1:5432/pagila

Step 2: Explore 3NF Schema

Schema of Pagila

"insert link to schema

Lets run queries on the database Pagila

select count(*) from store;

Output:

2
select count(*) from film;

Output:

1000
select count(*) from country

Output:

109
select count(*) from payment;

Output:

16049

More Queries

When and what tine period are we looking at for these payments?

select min(payment_date) as start, max(payment_date) as end from payment;

Output

             start             |              end              
-------------------------------+-------------------------------
 2020-01-24 16:21:56.996577-05 | 2020-05-14 08:44:29.996577-04

When are Where are these Events taking Place?

select district, count(1)
from address
group by district
order by count(1) desc
limit 5;

Ouput:

   district    | count 
---------------+-------
 Buenos Aires  |    10
 California    |     9
 Shandong      |     9
 West Bengali  |     9
 Uttar Pradesh |     8
(5 rows)

Step 3: Simple Data Analysis

Run Simple Qs

select film_id, title, release_year, rental_rate, rating  from film limit 5;

Output:

film_id |      title       | release_year | rental_rate | rating 
---------+------------------+--------------+-------------+--------
       1 | ACADEMY DINOSAUR |         2006 |        0.99 | PG
       2 | ACE GOLDFINGER   |         2006 |        4.99 | G
       3 | ADAPTATION HOLES |         2006 |        2.99 | NC-17
       4 | AFFAIR PREJUDICE |         2006 |        2.99 | G
       5 | AFRICAN EGG      |         2006 |        2.99 | G

Get the movie of every payment

SELECT f.title, p.amount, p.payment_date, p.customer_id                                            
FROM payment p
JOIN rental r    ON ( p.rental_id = r.rental_id )
JOIN inventory i ON ( r.inventory_id = i.inventory_id )
JOIN film f ON ( i.film_id = f.film_id)
limit 10;

Output:

      title      | amount |         payment_date          | customer_id 
-----------------+--------+-------------------------------+-------------
 SWARM GOLD      |   1.99 | 2020-01-24 16:40:19.996577-05 |         269
 PACKER MADIGAN  |   0.99 | 2020-01-25 10:16:50.996577-05 |         269
 SOMETHING DUCK  |   6.99 | 2020-01-28 16:44:14.996577-05 |         269
 DRACULA CRYSTAL |   0.99 | 2020-01-28 19:58:02.996577-05 |         269
 CLOSER BANG     |   4.99 | 2020-01-29 03:10:06.996577-05 |         269
 DANCES NONE     |   2.99 | 2020-01-31 07:23:14.996577-05 |         269
 STRAIGHT HOURS  |   1.99 | 2020-01-26 00:10:14.996577-05 |         270
 MARRIED GO      |   4.99 | 2020-01-30 23:03:42.996577-05 |         270
 DIVORCE SHINING |   8.99 | 2020-01-31 06:59:15.996577-05 |         271
 GOLDMINE TYCOON |   0.99 | 2020-01-24 21:47:17.996577-05 |         272

Write a query that displays the amount of revenue from each title. Limit the results to the top 10 grossing titles

SELECT f.title, sum(p.amount) as revenue                                           
FROM payment p
JOIN rental r    ON ( p.rental_id = r.rental_id )
JOIN inventory i ON ( r.inventory_id = i.inventory_id )
JOIN film f ON ( i.film_id = f.film_id)
group by f.title
order by revenue desc
limit 10;

Output:

       title       | revenue 
-------------------+---------
 TELEGRAPH VOYAGE  |  231.73
 WIFE TURN         |  223.69
 ZORRO ARK         |  214.69
 GOODFELLAS SALUTE |  209.69
 SATURDAY LAMBS    |  204.72
 TITANS JERK       |  201.71
 TORQUE BOUND      |  198.72
 HARRY IDAHO       |  195.70
 INNOCENT USUAL    |  191.74
 HUSTLER PARTY     |  190.78

Top Grossing Cities!

SELECT ci.city,sum(p.amount) as revenue
FROM payment p
JOIN customer c  ON ( p.customer_id = c.customer_id )
JOIN address a ON ( c.address_id = a.address_id )
JOIN city ci ON ( a.city_id = ci.city_id )
group by ci.city
order by revenue desc
limit 10;

Output:

        city         | revenue 
---------------------+---------
 Cape Coral          |  221.55
 Saint-Denis         |  216.54
 Aurora              |  198.50
 Molodetno           |  195.58
 Santa Brbara dOeste |  194.61
 Apeldoorn           |  194.61
 Qomsheh             |  186.62
 London              |  180.52
 Ourense (Orense)    |  177.60
 Bijapur             |  175.61

Revenue of a movie by customer city and by month

SELECT sum(p.amount) as revenue, EXTRACT(month FROM p.payment_date) as month
from payment p
group by month
order by revenue desc
limit 10;

Output:

 revenue  | month 
----------+-------
 28559.46 |     4
 23237.11 |     3
 10281.33 |     2
  4824.43 |     1
   514.18 |     5

Each movie by customer city and by month (data cube)

SELECT f.title, p.amount, p.customer_id, ci.city, p.payment_date,EXTRACT(month FROM p.payment_date) as month
FROM payment p
JOIN rental r    ON ( p.rental_id = r.rental_id )
JOIN inventory i ON ( r.inventory_id = i.inventory_id )
JOIN film f ON ( i.film_id = f.film_id)
JOIN customer c  ON ( p.customer_id = c.customer_id )
JOIN address a ON ( c.address_id = a.address_id )
JOIN city ci ON ( a.city_id = ci.city_id )
order by p.payment_date
limit 10;

Output:

title      | amount | customer_id |          city          |         payment_date          | month 
-----------------+--------+-------------+------------------------+-------------------------------+-------
 BLANKET BEVERLY |   2.99 |         130 | guas Lindas de Gois    | 2020-01-24 16:21:56.996577-05 |     1
 FREAKY POCUS    |   2.99 |         459 | Qomsheh                | 2020-01-24 16:22:59.996577-05 |     1
 GRADUATE LORD   |   3.99 |         408 | Jaffna                 | 2020-01-24 16:32:05.996577-05 |     1
 LOVE SUICIDES   |   4.99 |         333 | Baku                   | 2020-01-24 16:33:07.996577-05 |     1
 IDOLS SNATCHERS |   6.99 |         222 | Jaroslavl              | 2020-01-24 16:33:47.996577-05 |     1
 MYSTIC TRUMAN   |   0.99 |         549 | Santiago de Compostela | 2020-01-24 16:36:33.996577-05 |     1
 SWARM GOLD      |   1.99 |         269 | Salinas                | 2020-01-24 16:40:19.996577-05 |     1
 LAWLESS VISION  |   4.99 |         239 | Ciomas                 | 2020-01-24 17:00:12.996577-05 |     1
 MATRIX SNOWMAN  |   4.99 |         126 | Po                     | 2020-01-24 17:29:06.996577-05 |     1
 HANGING DEEP    |   5.99 |         399 | Okara                  | 2020-01-24 17:30:47.996577-05 |     1

Step 4: Creating Facts and Dimensions

Create Star Schema

Create dimDate Table


CREATE TABLE dimDate
(
  date_key serial PRIMARY KEY,
  date date NOT NULL,
  year smallint NOT NULL,
  quarter smallint NOT NULL,
  month smallint NOT NULL,
  day smallint NOT NULL,
  week smallint NOT NULL,
  is_weekend boolean
);

To check your work, run the following query to see a table with the field names and data types. The output should match the table below.

SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name   = 'dimdate'

Output:

 column_name | data_type 
-------------+-----------
 date_key    | integer
 date        | date
 year        | smallint
 quarter     | smallint
 month       | smallint
 day         | smallint
 week        | smallint
 is_weekend  | boolean

Creating more Dimension tables

CREATE TABLE dimCustomer
(
  customer_key SERIAL PRIMARY KEY,
  customer_id  smallint NOT NULL,
  first_name   varchar(45) NOT NULL,
  last_name    varchar(45) NOT NULL,
  email        varchar(50),
  address      varchar(50) NOT NULL,
  address2     varchar(50),
  district     varchar(20) NOT NULL,
  city         varchar(50) NOT NULL,
  country      varchar(50) NOT NULL,
  postal_code  varchar(10),
  phone        varchar(20) NOT NULL,
  active       smallint NOT NULL,
  create_date  timestamp NOT NULL,
  start_date   date NOT NULL,
  end_date     date NOT NULL
);

CREATE TABLE dimMovie
(
  movie_key          SERIAL PRIMARY KEY,
  film_id            smallint NOT NULL,
  title              varchar(255) NOT NULL,
  description        text,
  release_year       year,
  language           varchar(20) NOT NULL,
  original_language  varchar(20),
  rental_duration    smallint NOT NULL,
  length             smallint NOT NULL,
  rating             varchar(5) NOT NULL,
  special_features   varchar(60) NOT NULL
);
CREATE TABLE dimStore
(
  store_key           SERIAL PRIMARY KEY,
  store_id            smallint NOT NULL,
  address             varchar(50) NOT NULL,
  address2            varchar(50),
  district            varchar(20) NOT NULL,
  city                varchar(50) NOT NULL,
  country             varchar(50) NOT NULL,
  postal_code         varchar(10),
  manager_first_name  varchar(45) NOT NULL,
  manager_last_name   varchar(45) NOT NULL,
  start_date          date NOT NULL,
  end_date            date NOT NULL
);

Creating the Fact Table

CREATE TABLE factSales
(
  sales_key serial PRIMARY KEY,
  date_key integer REFERENCES dimdate (date_key),
  customer_key integer REFERENCES dimcustomer (customer_key),
  movie_key integer REFERENCES dimmovie (movie_key),
  store_key integer REFERENCES dimstore (store_key),
  sales_amount numeric    
);

let check our work for the fact table

SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name   = 'factsales'

Output:

 column_name  | data_type 
--------------+-----------
 sales_key    | integer
 date_key     | integer
 customer_key | integer
 movie_key    | integer
 store_key    | integer
 sales_amount | numeric

Step 5: ETL the data from 3NF Table to Star Schema (Facts and Dimension Tables)

Lets create a test table to test called test_table

CREATE TABLE test_table
(
  date timestamp,
  revenue  decimal(5,2)
);

With the test table we can use the INSERT AND SELECT statements to populate the table as shown below:

INSERT INTO test_table (date, revenue)
SELECT payment_date AS date,
       amount AS revenue
FROM payment;

Lets look at our table now:

SELECT * FROM test_table LIMIT 10;

Output:

           date            | revenue 
----------------------------+---------
 2020-01-24 16:40:19.996577 |    1.99
 2020-01-25 10:16:50.996577 |    0.99
 2020-01-28 16:44:14.996577 |    6.99
 2020-01-28 19:58:02.996577 |    0.99
 2020-01-29 03:10:06.996577 |    4.99
 2020-01-31 07:23:14.996577 |    2.99
 2020-01-26 00:10:14.996577 |    1.99
 2020-01-30 23:03:42.996577 |    4.99
 2020-01-31 06:59:15.996577 |    8.99
 2020-01-24 21:47:17.996577 |    0.99

Now that we know how to populate a table from another table in SQL lets drop the test table

DROP TABLE test_table;

Populating our table

INSERT INTO dimDate (date_key, date, year, quarter, month, day, week, is_weekend)
SELECT DISTINCT(TO_CHAR(payment_date :: DATE, 'yyyyMMDD')::integer) AS date_key,
       date(payment_date)                                           AS date,
       EXTRACT(year FROM payment_date)                              AS year,
       EXTRACT(quarter FROM payment_date)                           AS quarter,
       EXTRACT(month FROM payment_date)                             AS month,
       EXTRACT(day FROM payment_date)                               AS day,
       EXTRACT(week FROM payment_date)                              AS week,
       CASE WHEN EXTRACT(ISODOW FROM payment_date) IN (6, 7) THEN true ELSE false END AS is_weekend
FROM payment;
INSERT INTO dimCustomer (customer_id, first_name, last_name, email, address, 
                         address2, district, city, country, postal_code, phone, 
                         active, create_date, start_date, end_date)
SELECT DISTINCT c.customer_id, c.first_name, c.last_name, c.email, a.address, a.address2, 
       a.district,ci.city, co.country, a.postal_code, a.phone, 
       CASE WHEN c.activebool  THEN 1 ELSE 0 END, 
        (cast(c.create_date as text) || ' 00:00:01'):: timestamp, 
                date(now()) AS start_date, date(now())
FROM customer c
JOIN address a  ON (c.address_id = a.address_id)
JOIN city ci    ON (a.city_id = ci.city_id)
JOIN country co ON (ci.country_id = co.country_id)
LIMIT 10;

Populate the dimMovie table with data from the film and language tables.

INSERT INTO dimMovie (film_id, title, description, release_year, language,
                     original_language, rental_duration, length, rating,
                     special_features)
SELECT DISTINCT f.film_id, f.title, f.description, f.release_year, l.name,
                     orig_lang.name, f.rental_duration, f.length, f.rating,
                     f.special_features
FROM film f
JOIN language l              ON (f.language_id=l.language_id)
LEFT JOIN language orig_lang ON (f.original_language_id = orig_lang.language_id);

Populate the dimStore table with data from the store, staff, address, city, and country tables. This time, there’s no guide. You should write the query from scratch. Use the previous queries as a reference.

INSERT INTO dimStore (store_id, address, address2, district, city, country, postal_code,
                     manager_first_name, manager_last_name, start_date, end_date)
SELECT DISTINCT s.store_id, a.address, a.address2, a.district, ci.city, co.country, a.postal_code,
                     st.first_name, st.last_name, date(now()) AS start_date, date(now())
FROM store s
JOIN address a  ON (s.address_id=a.address_id)
JOIN city ci ON (a.city_id = ci.city_id)
JOIN country co ON (ci.country_id = co.country_id)
JOIN staff st ON (s.manager_staff_id = st.staff_id);

TODO: Populate the factSales table with data from the payment, rental, and inventory tables. This time, there’s no guide. You should write the query from scratch. Use the previous queries as a reference.

INSERT INTO factsales (date_key, customer_key, movie_key, store_key, sales_amount)
SELECT  TO_CHAR(p.payment_date :: DATE, 'yyyyMMDD')::integer, dc.customer_key, 
        dm.movie_key, ds.store_key, p.amount
FROM payment p
JOIN rental r ON (p.rental_id = r.rental_id)
JOIN inventory i ON (r.inventory_id = i.inventory_id)
JOIN customer c ON (p.customer_id=c.customer_id)
JOIN dimcustomer dc ON(p.customer_id = dc.customer_id)
JOIN dimstore ds ON (c.store_id = ds.store_id)
JOIN dimmovie dm ON (i.film_id = dm.film_id)

Step 6: Compare differences between Star and 3NF Schemas

Learn that Star Schema is easier to undersant and write queries against 3NF scheama

  • write queries and stuff

Part 2: Working with OLAP Cube

Goal:

Create an OLAP cube from facts and dimensions

OLAP Cube

An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights.

"insert OLAP"

Slicing and Dicing

Create database and import schema !

!PGPASSWORD=student createdb -h 127.0.0.1 -U devinpowers pagila_star
!PGPASSWORD=student psql -q -h 127.0.0.1 -U devinpowers -d pagila_star -f Data/pagila-star.sql

NOTE: THis is the same database as before just with only the STAR Schema

Connecting to database

import sql
%load_ext sql

DB_ENDPOINT = "127.0.0.1"
DB = 'pagila_star'
DB_USER = 'devinpowers'
DB_PASSWORD = 'student'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)
%sql $conn_string

Slicing

Slicing is the reduction of the dimensionality of a cube by 1. (E.g. 3 dimensions to 2); fixing one of the dimensions to a single value!

Large blocks of data are cut into smaller segments and the process is repeated until the correct level of detail is achieved for proper analysis!

"Inserting Slicing"

Lets look at an example, where we have a 3-dimensional cube with the columns day, rating, and country.

Lets write a query that reduces the dimensionality by limiting the results to only include movies with a rating of “PG-13”

SELECT t.day, m.rating, s.city, sum(f.sales_amount) as revenue
FROM factsales f 
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimmovie m ON (f.movie_key = m.movie_key)
JOIN dimstore s ON (f.store_key = s.store_key)
WHERE m.rating = 'PG-13'
GROUP BY t.day, m.rating, s.city
ORDER BY  revenue desc
LIMIT 10;

Output:

 day | rating |    city    | revenue 
-----+--------+------------+---------
  19 | PG-13  | Lethbridge |   47.92
  30 | PG-13  | Woodridge  |   35.92
  30 | PG-13  | Lethbridge |   25.94
   1 | PG-13  | Lethbridge |   25.94
  22 | PG-13  | Woodridge  |   25.94
  15 | PG-13  | Woodridge  |   21.98
  26 | PG-13  | Lethbridge |   21.96
   8 | PG-13  | Woodridge  |   19.96
  22 | PG-13  | Lethbridge |   19.96
   7 | PG-13  | Lethbridge |   17.98

Dicing

Dicing is creating a subcube with the same dimensionality but fewer values for two or more dimensions

Lets write a query to create a subcube of the intial cube that includes movies with:

  • Rating of PG or PG-13
  • In the City of Lethbridge or Woodridge
  • Day equal to 1, 15, or 30
SELECT t.day, m.rating, s.city, sum(f.sales_amount) as revenue
FROM factsales f 
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimmovie m ON (f.movie_key = m.movie_key)
JOIN dimstore s ON (f.store_key = s.store_key)
WHERE m.rating in ('PG-13', 'PG') 
AND s.city in ('Lethbridge', 'Woodridge') 
AND t.day in ('1', '15', '30')
GROUP BY t.day, m.rating, s.city
ORDER BY  revenue desc
LIMIT 10

Output:

 day | rating |    city    | revenue 
-----+--------+------------+---------
  30 | PG-13  | Woodridge  |   35.92
   1 | PG-13  | Lethbridge |   25.94
  30 | PG-13  | Lethbridge |   25.94
  15 | PG-13  | Woodridge  |   21.98
   1 | PG     | Lethbridge |   21.92
   1 | PG     | Woodridge  |   15.94
  15 | PG     | Lethbridge |   11.98
   1 | PG-13  | Woodridge  |    9.98
  30 | PG     | Lethbridge |    7.96

Thus dicing and slicing allows detailed analysis of data from different angles.

OLAP Cubes: Roll-Up and Drill Down

Roll-Up

Term given to that data that has been summarized.

More on roll-up here

"Roll-Up"

SELECT t.day, m.rating, s.country, sum(f.sales_amount) as revenue
FROM factsales f 
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimmovie m ON (f.movie_key = m.movie_key)
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY t.day, m.rating, s.country
ORDER BY  revenue desc
LIMIT 10;

Output:

 day | rating |  country  | revenue 
-----+--------+-----------+---------
  19 | PG-13  | Canada    |   47.92
  17 | R      | Canada    |   47.92
  20 | NC-17  | Canada    |   43.94
  20 | R      | Australia |   39.94
  19 | NC-17  | Canada    |   35.92
  15 | R      | Australia |   35.92
  30 | PG-13  | Australia |   35.92
  22 | NC-17  | Australia |   33.94
  28 | PG     | Canada    |   31.94
  29 | R      | Canada    |   31.90

Drill-Down

Drill-Down is when you take a user from a more general view of data to a more specific one. Allows users to go deeper.

"Drill-Down"

Write a query that calculates revenue (sales_amount) by day, rating, and district. Sort the data by revenue in descending order, and limit the data to the top 10.

SELECT t.day, m.rating, s.district, sum(f.sales_amount) as revenue
FROM factsales f 
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimmovie m ON (f.movie_key = m.movie_key)
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY t.day, m.rating,  s.district
ORDER BY  revenue desc
LIMIT 10;

Output

 day | rating | district | revenue 
-----+--------+----------+---------
  17 | R      | Alberta  |   47.92
  19 | PG-13  | Alberta  |   47.92
  20 | NC-17  | Alberta  |   43.94
  20 | R      | QLD      |   39.94
  30 | PG-13  | QLD      |   35.92
  15 | R      | QLD      |   35.92
  19 | NC-17  | Alberta  |   35.92
  22 | NC-17  | QLD      |   33.94
  28 | PG     | Alberta  |   31.94
  29 | R      | Alberta  |   31.90

Review: Roll-Up vs Drill Down

"insert photo"

OLAP Cubes- Grouping Sets

Example Querys

Question: Write a query that calculates the total revenue.

SELECT sum(f.sales_amount) as Revenue
FROM factsales f;

Output:

 Revenue 
---------
 2422.26

Question: Write a query that calculates total revenue (sales_amount) by country.

SELECT s.country, sum(f.sales_amount) as revenue
FROM factsales f 
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY s.country
ORDER BY  revenue desc;

Output:

  country  | revenue 
-----------+---------
 Canada    | 1330.96
 Australia | 1091.30

Question: Write a query that calculates total revenue (sales_amount) by month.

SELECT t.month, sum(f.sales_amount) as revenue
FROM factsales f 
JOIN dimdate t ON (f.date_key = t.date_key)
GROUP BY  t.month
ORDER BY  revenue desc;

Output:

 month | revenue 
-------+---------
     3 |  901.96
     4 |  853.92
     2 |  445.10
     1 |  199.38
     5 |   21.90

Question: Write a query that calculates total revenue (sales_amount) by month and country. Sort the data by month, country, and revenue in descending order.

SELECT t.month, s.country, sum(f.sales_amount) as revenue
FROM factsales f 
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY t.month, s.country
ORDER BY  revenue desc;

Output:

 month |  country  | revenue 
-------+-----------+---------
     3 | Canada    |  530.80
     4 | Canada    |  492.84
     3 | Australia |  371.16
     4 | Australia |  361.08
     2 | Canada    |  231.58
     2 | Australia |  213.52
     1 | Australia |  135.60
     1 | Canada    |   63.78
     5 | Canada    |   11.96
     5 | Australia |    9.94

Question: Write a query that calculates total revenue at the various grouping levels done above (total, by month, by country, by month & country) all at once using the grouping sets function.

SELECT t.month, s.country, sum(f.sales_amount) as revenue
FROM factsales f 
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY GROUPING SETS (
    (),
    t.month,
    s.country,
    (t.month, s.country)
)
LIMIT 10;

Output:

 month |  country  | revenue 
-------+-----------+---------
       |           | 2422.26
     5 | Canada    |   11.96
     3 | Canada    |  530.80
     2 | Australia |  213.52
     5 | Australia |    9.94
     2 | Canada    |  231.58
     1 | Australia |  135.60
     4 | Australia |  361.08
     3 | Australia |  371.16
     4 | Canada    |  492.84

OLAP Cubes- Cube

[“Insert Cube photos here”]

Write a query that calculates the various levels of aggregation done in the grouping sets exercise (total, by month, by country, by month & country) using the CUBE function.

SELECT t.month, s.country, sum(f.sales_amount) as revenue
FROM factsales f 
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY CUBE (t.month, s.country)
LIMIT 20

Output:

month |  country  | revenue 
-------+-----------+---------
       |           | 2422.26
     5 | Canada    |   11.96
     3 | Canada    |  530.80
     2 | Australia |  213.52
     5 | Australia |    9.94
     2 | Canada    |  231.58
     1 | Australia |  135.60
     4 | Australia |  361.08
     3 | Australia |  371.16
     4 | Canada    |  492.84
     1 | Canada    |   63.78
     3 |           |  901.96
     5 |           |   21.90
     4 |           |  853.92
     2 |           |  445.10
     1 |           |  199.38
       | Canada    | 1330.96
       | Australia | 1091.30

Revenue Total, by Month, by Country, by Month & Country All in one shot, NAIVE way:

The naive way to create the same table as above is to write several queries and UNION them together. Grouping sets and cubes produce queries that are shorter to write, easier to read, and more performant. Run the naive query below and compare the time it takes to run to the time it takes the cube query to run

SELECT  NULL as month, NULL as country, sum(sales_amount) as revenue
FROM factSales
    UNION all 
SELECT NULL, dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by  dimStore.country
    UNION all 
SELECT cast(dimDate.month as text) , NULL, sum(sales_amount) as revenue
FROM factSales
JOIN dimDate on (dimDate.date_key = factSales.date_key)
GROUP by dimDate.month
    UNION all
SELECT cast(dimDate.month as text),dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by (dimDate.month, dimStore.country);

Output:

month |  country  | revenue 
-------+-----------+---------
       |           | 2422.26
       | Canada    | 1330.96
       | Australia | 1091.30
 3     |           |  901.96
 5     |           |   21.90
 4     |           |  853.92
 2     |           |  445.10
 1     |           |  199.38
 5     | Canada    |   11.96
 3     | Canada    |  530.80
 2     | Australia |  213.52
 5     | Australia |    9.94
 2     | Canada    |  231.58
 1     | Australia |  135.60
 4     | Australia |  361.08
 3     | Australia |  371.16
 4     | Canada    |  492.84
 1     | Canada    |   63.78