Data Engineering: Data Warehouses
Key Takeaways:
-
Understand Data Warehousing Architecture [x]
-
Run an ETL process to Denormalize a database from (3NF to Star) [x]
-
Create an OLAP Cube from Facts and Dimensions [x]
-
Compare Columnar vs. Row Oriented Approaches
Intro to Data Warehouses Part 1: Understanding Architecture of Data Warehouses
- Create Database pagila via Jupyter Notebook using the schema and data provided in the folder data
!PGPASSWORD=student createdb -h 127.0.0.1 -U devinpowers pagila
!PGPASSWORD=student psql -q -h 127.0.0.1 -U devinpowers -d pagila -f Data/pagila-schema.sql
!PGPASSWORD=student psql -q -h 127.0.0.1 -U devinpowers -d pagila -f Data/pagila-data.sql
- Connect to database pagila
In Jupyter Notebook use %load_ext sql
to load sql into notebook.
%load_ext sql
DB_ENDPOINT = "127.0.0.1"
DB = 'pagila'
DB_USER = 'devinpowers'
DB_PASSWORD = 'student'
DB_PORT = '5432'
# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
.format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)
print(conn_string)
%sql $conn_string
Should print out:
postgresql://devinpowers:student@127.0.0.1:5432/pagila
Step 2: Explore 3NF Schema
Schema of Pagila
Lets run queries on the database Pagila
select count(*) from store;
Output:
2
select count(*) from film;
Output:
1000
select count(*) from country
Output:
109
select count(*) from payment;
Output:
16049
More Queries
When and what tine period are we looking at for these payments?
select min(payment_date) as start, max(payment_date) as end from payment;
Output
start | end
-------------------------------+-------------------------------
2020-01-24 16:21:56.996577-05 | 2020-05-14 08:44:29.996577-04
When are Where are these Events taking Place?
select district, count(1)
from address
group by district
order by count(1) desc
limit 5;
Ouput:
district | count
---------------+-------
Buenos Aires | 10
California | 9
Shandong | 9
West Bengali | 9
Uttar Pradesh | 8
(5 rows)
Step 3: Simple Data Analysis
Run Simple Qs
select film_id, title, release_year, rental_rate, rating from film limit 5;
Output:
film_id | title | release_year | rental_rate | rating
---------+------------------+--------------+-------------+--------
1 | ACADEMY DINOSAUR | 2006 | 0.99 | PG
2 | ACE GOLDFINGER | 2006 | 4.99 | G
3 | ADAPTATION HOLES | 2006 | 2.99 | NC-17
4 | AFFAIR PREJUDICE | 2006 | 2.99 | G
5 | AFRICAN EGG | 2006 | 2.99 | G
Get the movie of every payment
SELECT f.title, p.amount, p.payment_date, p.customer_id
FROM payment p
JOIN rental r ON ( p.rental_id = r.rental_id )
JOIN inventory i ON ( r.inventory_id = i.inventory_id )
JOIN film f ON ( i.film_id = f.film_id)
limit 10;
Output:
title | amount | payment_date | customer_id
-----------------+--------+-------------------------------+-------------
SWARM GOLD | 1.99 | 2020-01-24 16:40:19.996577-05 | 269
PACKER MADIGAN | 0.99 | 2020-01-25 10:16:50.996577-05 | 269
SOMETHING DUCK | 6.99 | 2020-01-28 16:44:14.996577-05 | 269
DRACULA CRYSTAL | 0.99 | 2020-01-28 19:58:02.996577-05 | 269
CLOSER BANG | 4.99 | 2020-01-29 03:10:06.996577-05 | 269
DANCES NONE | 2.99 | 2020-01-31 07:23:14.996577-05 | 269
STRAIGHT HOURS | 1.99 | 2020-01-26 00:10:14.996577-05 | 270
MARRIED GO | 4.99 | 2020-01-30 23:03:42.996577-05 | 270
DIVORCE SHINING | 8.99 | 2020-01-31 06:59:15.996577-05 | 271
GOLDMINE TYCOON | 0.99 | 2020-01-24 21:47:17.996577-05 | 272
Write a query that displays the amount of revenue from each title. Limit the results to the top 10 grossing titles
SELECT f.title, sum(p.amount) as revenue
FROM payment p
JOIN rental r ON ( p.rental_id = r.rental_id )
JOIN inventory i ON ( r.inventory_id = i.inventory_id )
JOIN film f ON ( i.film_id = f.film_id)
group by f.title
order by revenue desc
limit 10;
Output:
title | revenue
-------------------+---------
TELEGRAPH VOYAGE | 231.73
WIFE TURN | 223.69
ZORRO ARK | 214.69
GOODFELLAS SALUTE | 209.69
SATURDAY LAMBS | 204.72
TITANS JERK | 201.71
TORQUE BOUND | 198.72
HARRY IDAHO | 195.70
INNOCENT USUAL | 191.74
HUSTLER PARTY | 190.78
Top Grossing Cities!
SELECT ci.city,sum(p.amount) as revenue
FROM payment p
JOIN customer c ON ( p.customer_id = c.customer_id )
JOIN address a ON ( c.address_id = a.address_id )
JOIN city ci ON ( a.city_id = ci.city_id )
group by ci.city
order by revenue desc
limit 10;
Output:
city | revenue
---------------------+---------
Cape Coral | 221.55
Saint-Denis | 216.54
Aurora | 198.50
Molodetno | 195.58
Santa Brbara dOeste | 194.61
Apeldoorn | 194.61
Qomsheh | 186.62
London | 180.52
Ourense (Orense) | 177.60
Bijapur | 175.61
Revenue of a movie by customer city and by month
SELECT sum(p.amount) as revenue, EXTRACT(month FROM p.payment_date) as month
from payment p
group by month
order by revenue desc
limit 10;
Output:
revenue | month
----------+-------
28559.46 | 4
23237.11 | 3
10281.33 | 2
4824.43 | 1
514.18 | 5
Each movie by customer city and by month (data cube)
SELECT f.title, p.amount, p.customer_id, ci.city, p.payment_date,EXTRACT(month FROM p.payment_date) as month
FROM payment p
JOIN rental r ON ( p.rental_id = r.rental_id )
JOIN inventory i ON ( r.inventory_id = i.inventory_id )
JOIN film f ON ( i.film_id = f.film_id)
JOIN customer c ON ( p.customer_id = c.customer_id )
JOIN address a ON ( c.address_id = a.address_id )
JOIN city ci ON ( a.city_id = ci.city_id )
order by p.payment_date
limit 10;
Output:
title | amount | customer_id | city | payment_date | month
-----------------+--------+-------------+------------------------+-------------------------------+-------
BLANKET BEVERLY | 2.99 | 130 | guas Lindas de Gois | 2020-01-24 16:21:56.996577-05 | 1
FREAKY POCUS | 2.99 | 459 | Qomsheh | 2020-01-24 16:22:59.996577-05 | 1
GRADUATE LORD | 3.99 | 408 | Jaffna | 2020-01-24 16:32:05.996577-05 | 1
LOVE SUICIDES | 4.99 | 333 | Baku | 2020-01-24 16:33:07.996577-05 | 1
IDOLS SNATCHERS | 6.99 | 222 | Jaroslavl | 2020-01-24 16:33:47.996577-05 | 1
MYSTIC TRUMAN | 0.99 | 549 | Santiago de Compostela | 2020-01-24 16:36:33.996577-05 | 1
SWARM GOLD | 1.99 | 269 | Salinas | 2020-01-24 16:40:19.996577-05 | 1
LAWLESS VISION | 4.99 | 239 | Ciomas | 2020-01-24 17:00:12.996577-05 | 1
MATRIX SNOWMAN | 4.99 | 126 | Po | 2020-01-24 17:29:06.996577-05 | 1
HANGING DEEP | 5.99 | 399 | Okara | 2020-01-24 17:30:47.996577-05 | 1
Step 4: Creating Facts and Dimensions
Create Star Schema
Create dimDate Table
CREATE TABLE dimDate
(
date_key serial PRIMARY KEY,
date date NOT NULL,
year smallint NOT NULL,
quarter smallint NOT NULL,
month smallint NOT NULL,
day smallint NOT NULL,
week smallint NOT NULL,
is_weekend boolean
);
To check your work, run the following query to see a table with the field names and data types. The output should match the table below.
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'dimdate'
Output:
column_name | data_type
-------------+-----------
date_key | integer
date | date
year | smallint
quarter | smallint
month | smallint
day | smallint
week | smallint
is_weekend | boolean
Creating more Dimension tables
CREATE TABLE dimCustomer
(
customer_key SERIAL PRIMARY KEY,
customer_id smallint NOT NULL,
first_name varchar(45) NOT NULL,
last_name varchar(45) NOT NULL,
email varchar(50),
address varchar(50) NOT NULL,
address2 varchar(50),
district varchar(20) NOT NULL,
city varchar(50) NOT NULL,
country varchar(50) NOT NULL,
postal_code varchar(10),
phone varchar(20) NOT NULL,
active smallint NOT NULL,
create_date timestamp NOT NULL,
start_date date NOT NULL,
end_date date NOT NULL
);
CREATE TABLE dimMovie
(
movie_key SERIAL PRIMARY KEY,
film_id smallint NOT NULL,
title varchar(255) NOT NULL,
description text,
release_year year,
language varchar(20) NOT NULL,
original_language varchar(20),
rental_duration smallint NOT NULL,
length smallint NOT NULL,
rating varchar(5) NOT NULL,
special_features varchar(60) NOT NULL
);
CREATE TABLE dimStore
(
store_key SERIAL PRIMARY KEY,
store_id smallint NOT NULL,
address varchar(50) NOT NULL,
address2 varchar(50),
district varchar(20) NOT NULL,
city varchar(50) NOT NULL,
country varchar(50) NOT NULL,
postal_code varchar(10),
manager_first_name varchar(45) NOT NULL,
manager_last_name varchar(45) NOT NULL,
start_date date NOT NULL,
end_date date NOT NULL
);
Creating the Fact Table
CREATE TABLE factSales
(
sales_key serial PRIMARY KEY,
date_key integer REFERENCES dimdate (date_key),
customer_key integer REFERENCES dimcustomer (customer_key),
movie_key integer REFERENCES dimmovie (movie_key),
store_key integer REFERENCES dimstore (store_key),
sales_amount numeric
);
let check our work for the fact table
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'factsales'
Output:
column_name | data_type
--------------+-----------
sales_key | integer
date_key | integer
customer_key | integer
movie_key | integer
store_key | integer
sales_amount | numeric
Step 5: ETL the data from 3NF Table to Star Schema (Facts and Dimension Tables)
Lets create a test table to test called test_table
CREATE TABLE test_table
(
date timestamp,
revenue decimal(5,2)
);
With the test table we can use the INSERT AND SELECT statements to populate the table as shown below:
INSERT INTO test_table (date, revenue)
SELECT payment_date AS date,
amount AS revenue
FROM payment;
Lets look at our table now:
SELECT * FROM test_table LIMIT 10;
Output:
date | revenue
----------------------------+---------
2020-01-24 16:40:19.996577 | 1.99
2020-01-25 10:16:50.996577 | 0.99
2020-01-28 16:44:14.996577 | 6.99
2020-01-28 19:58:02.996577 | 0.99
2020-01-29 03:10:06.996577 | 4.99
2020-01-31 07:23:14.996577 | 2.99
2020-01-26 00:10:14.996577 | 1.99
2020-01-30 23:03:42.996577 | 4.99
2020-01-31 06:59:15.996577 | 8.99
2020-01-24 21:47:17.996577 | 0.99
Now that we know how to populate a table from another table in SQL lets drop the test table
DROP TABLE test_table;
Populating our table
INSERT INTO dimDate (date_key, date, year, quarter, month, day, week, is_weekend)
SELECT DISTINCT(TO_CHAR(payment_date :: DATE, 'yyyyMMDD')::integer) AS date_key,
date(payment_date) AS date,
EXTRACT(year FROM payment_date) AS year,
EXTRACT(quarter FROM payment_date) AS quarter,
EXTRACT(month FROM payment_date) AS month,
EXTRACT(day FROM payment_date) AS day,
EXTRACT(week FROM payment_date) AS week,
CASE WHEN EXTRACT(ISODOW FROM payment_date) IN (6, 7) THEN true ELSE false END AS is_weekend
FROM payment;
INSERT INTO dimCustomer (customer_id, first_name, last_name, email, address,
address2, district, city, country, postal_code, phone,
active, create_date, start_date, end_date)
SELECT DISTINCT c.customer_id, c.first_name, c.last_name, c.email, a.address, a.address2,
a.district,ci.city, co.country, a.postal_code, a.phone,
CASE WHEN c.activebool THEN 1 ELSE 0 END,
(cast(c.create_date as text) || ' 00:00:01'):: timestamp,
date(now()) AS start_date, date(now())
FROM customer c
JOIN address a ON (c.address_id = a.address_id)
JOIN city ci ON (a.city_id = ci.city_id)
JOIN country co ON (ci.country_id = co.country_id)
LIMIT 10;
Populate the dimMovie table with data from the film and language tables.
INSERT INTO dimMovie (film_id, title, description, release_year, language,
original_language, rental_duration, length, rating,
special_features)
SELECT DISTINCT f.film_id, f.title, f.description, f.release_year, l.name,
orig_lang.name, f.rental_duration, f.length, f.rating,
f.special_features
FROM film f
JOIN language l ON (f.language_id=l.language_id)
LEFT JOIN language orig_lang ON (f.original_language_id = orig_lang.language_id);
Populate the dimStore table with data from the store, staff, address, city, and country tables. This time, there’s no guide. You should write the query from scratch. Use the previous queries as a reference.
INSERT INTO dimStore (store_id, address, address2, district, city, country, postal_code,
manager_first_name, manager_last_name, start_date, end_date)
SELECT DISTINCT s.store_id, a.address, a.address2, a.district, ci.city, co.country, a.postal_code,
st.first_name, st.last_name, date(now()) AS start_date, date(now())
FROM store s
JOIN address a ON (s.address_id=a.address_id)
JOIN city ci ON (a.city_id = ci.city_id)
JOIN country co ON (ci.country_id = co.country_id)
JOIN staff st ON (s.manager_staff_id = st.staff_id);
TODO: Populate the factSales table with data from the payment, rental, and inventory tables. This time, there’s no guide. You should write the query from scratch. Use the previous queries as a reference.
INSERT INTO factsales (date_key, customer_key, movie_key, store_key, sales_amount)
SELECT TO_CHAR(p.payment_date :: DATE, 'yyyyMMDD')::integer, dc.customer_key,
dm.movie_key, ds.store_key, p.amount
FROM payment p
JOIN rental r ON (p.rental_id = r.rental_id)
JOIN inventory i ON (r.inventory_id = i.inventory_id)
JOIN customer c ON (p.customer_id=c.customer_id)
JOIN dimcustomer dc ON(p.customer_id = dc.customer_id)
JOIN dimstore ds ON (c.store_id = ds.store_id)
JOIN dimmovie dm ON (i.film_id = dm.film_id)
Step 6: Compare differences between Star and 3NF Schemas
Learn that Star Schema is easier to undersant and write queries against 3NF scheama
- write queries and stuff
Part 2: Working with OLAP Cube
Goal:
Create an OLAP cube from facts and dimensions
OLAP Cube
An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights.
Slicing and Dicing
Create database and import schema !
!PGPASSWORD=student createdb -h 127.0.0.1 -U devinpowers pagila_star
!PGPASSWORD=student psql -q -h 127.0.0.1 -U devinpowers -d pagila_star -f Data/pagila-star.sql
NOTE: THis is the same database as before just with only the STAR Schema
Connecting to database
import sql
%load_ext sql
DB_ENDPOINT = "127.0.0.1"
DB = 'pagila_star'
DB_USER = 'devinpowers'
DB_PASSWORD = 'student'
DB_PORT = '5432'
# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
.format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)
print(conn_string)
%sql $conn_string
Slicing
Slicing is the reduction of the dimensionality of a cube by 1. (E.g. 3 dimensions to 2); fixing one of the dimensions to a single value!
Large blocks of data are cut into smaller segments and the process is repeated until the correct level of detail is achieved for proper analysis!
Lets look at an example, where we have a 3-dimensional cube with the columns day, rating, and country.
Lets write a query that reduces the dimensionality by limiting the results to only include movies with a rating of “PG-13”
SELECT t.day, m.rating, s.city, sum(f.sales_amount) as revenue
FROM factsales f
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimmovie m ON (f.movie_key = m.movie_key)
JOIN dimstore s ON (f.store_key = s.store_key)
WHERE m.rating = 'PG-13'
GROUP BY t.day, m.rating, s.city
ORDER BY revenue desc
LIMIT 10;
Output:
day | rating | city | revenue
-----+--------+------------+---------
19 | PG-13 | Lethbridge | 47.92
30 | PG-13 | Woodridge | 35.92
30 | PG-13 | Lethbridge | 25.94
1 | PG-13 | Lethbridge | 25.94
22 | PG-13 | Woodridge | 25.94
15 | PG-13 | Woodridge | 21.98
26 | PG-13 | Lethbridge | 21.96
8 | PG-13 | Woodridge | 19.96
22 | PG-13 | Lethbridge | 19.96
7 | PG-13 | Lethbridge | 17.98
Dicing
Dicing is creating a subcube with the same dimensionality but fewer values for two or more dimensions
Lets write a query to create a subcube of the intial cube that includes movies with:
- Rating of PG or PG-13
- In the City of Lethbridge or Woodridge
- Day equal to 1, 15, or 30
SELECT t.day, m.rating, s.city, sum(f.sales_amount) as revenue
FROM factsales f
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimmovie m ON (f.movie_key = m.movie_key)
JOIN dimstore s ON (f.store_key = s.store_key)
WHERE m.rating in ('PG-13', 'PG')
AND s.city in ('Lethbridge', 'Woodridge')
AND t.day in ('1', '15', '30')
GROUP BY t.day, m.rating, s.city
ORDER BY revenue desc
LIMIT 10
Output:
day | rating | city | revenue
-----+--------+------------+---------
30 | PG-13 | Woodridge | 35.92
1 | PG-13 | Lethbridge | 25.94
30 | PG-13 | Lethbridge | 25.94
15 | PG-13 | Woodridge | 21.98
1 | PG | Lethbridge | 21.92
1 | PG | Woodridge | 15.94
15 | PG | Lethbridge | 11.98
1 | PG-13 | Woodridge | 9.98
30 | PG | Lethbridge | 7.96
Thus dicing and slicing allows detailed analysis of data from different angles.
OLAP Cubes: Roll-Up and Drill Down
Roll-Up
Term given to that data that has been summarized.
SELECT t.day, m.rating, s.country, sum(f.sales_amount) as revenue
FROM factsales f
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimmovie m ON (f.movie_key = m.movie_key)
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY t.day, m.rating, s.country
ORDER BY revenue desc
LIMIT 10;
Output:
day | rating | country | revenue
-----+--------+-----------+---------
19 | PG-13 | Canada | 47.92
17 | R | Canada | 47.92
20 | NC-17 | Canada | 43.94
20 | R | Australia | 39.94
19 | NC-17 | Canada | 35.92
15 | R | Australia | 35.92
30 | PG-13 | Australia | 35.92
22 | NC-17 | Australia | 33.94
28 | PG | Canada | 31.94
29 | R | Canada | 31.90
Drill-Down
Drill-Down is when you take a user from a more general view of data to a more specific one. Allows users to go deeper.
Write a query that calculates revenue (sales_amount) by day, rating, and district. Sort the data by revenue in descending order, and limit the data to the top 10.
SELECT t.day, m.rating, s.district, sum(f.sales_amount) as revenue
FROM factsales f
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimmovie m ON (f.movie_key = m.movie_key)
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY t.day, m.rating, s.district
ORDER BY revenue desc
LIMIT 10;
Output
day | rating | district | revenue
-----+--------+----------+---------
17 | R | Alberta | 47.92
19 | PG-13 | Alberta | 47.92
20 | NC-17 | Alberta | 43.94
20 | R | QLD | 39.94
30 | PG-13 | QLD | 35.92
15 | R | QLD | 35.92
19 | NC-17 | Alberta | 35.92
22 | NC-17 | QLD | 33.94
28 | PG | Alberta | 31.94
29 | R | Alberta | 31.90
Review: Roll-Up vs Drill Down
OLAP Cubes- Grouping Sets
Example Querys
Question: Write a query that calculates the total revenue.
SELECT sum(f.sales_amount) as Revenue
FROM factsales f;
Output:
Revenue
---------
2422.26
Question: Write a query that calculates total revenue (sales_amount) by country.
SELECT s.country, sum(f.sales_amount) as revenue
FROM factsales f
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY s.country
ORDER BY revenue desc;
Output:
country | revenue
-----------+---------
Canada | 1330.96
Australia | 1091.30
Question: Write a query that calculates total revenue (sales_amount) by month.
SELECT t.month, sum(f.sales_amount) as revenue
FROM factsales f
JOIN dimdate t ON (f.date_key = t.date_key)
GROUP BY t.month
ORDER BY revenue desc;
Output:
month | revenue
-------+---------
3 | 901.96
4 | 853.92
2 | 445.10
1 | 199.38
5 | 21.90
Question: Write a query that calculates total revenue (sales_amount) by month and country. Sort the data by month, country, and revenue in descending order.
SELECT t.month, s.country, sum(f.sales_amount) as revenue
FROM factsales f
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY t.month, s.country
ORDER BY revenue desc;
Output:
month | country | revenue
-------+-----------+---------
3 | Canada | 530.80
4 | Canada | 492.84
3 | Australia | 371.16
4 | Australia | 361.08
2 | Canada | 231.58
2 | Australia | 213.52
1 | Australia | 135.60
1 | Canada | 63.78
5 | Canada | 11.96
5 | Australia | 9.94
Question: Write a query that calculates total revenue at the various grouping levels done above (total, by month, by country, by month & country) all at once using the grouping sets function.
SELECT t.month, s.country, sum(f.sales_amount) as revenue
FROM factsales f
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY GROUPING SETS (
(),
t.month,
s.country,
(t.month, s.country)
)
LIMIT 10;
Output:
month | country | revenue
-------+-----------+---------
| | 2422.26
5 | Canada | 11.96
3 | Canada | 530.80
2 | Australia | 213.52
5 | Australia | 9.94
2 | Canada | 231.58
1 | Australia | 135.60
4 | Australia | 361.08
3 | Australia | 371.16
4 | Canada | 492.84
OLAP Cubes- Cube
[“Insert Cube photos here”]
Write a query that calculates the various levels of aggregation done in the grouping sets exercise (total, by month, by country, by month & country) using the CUBE function.
SELECT t.month, s.country, sum(f.sales_amount) as revenue
FROM factsales f
JOIN dimdate t ON (f.date_key = t.date_key)
JOIN dimstore s ON (f.store_key = s.store_key)
GROUP BY CUBE (t.month, s.country)
LIMIT 20
Output:
month | country | revenue
-------+-----------+---------
| | 2422.26
5 | Canada | 11.96
3 | Canada | 530.80
2 | Australia | 213.52
5 | Australia | 9.94
2 | Canada | 231.58
1 | Australia | 135.60
4 | Australia | 361.08
3 | Australia | 371.16
4 | Canada | 492.84
1 | Canada | 63.78
3 | | 901.96
5 | | 21.90
4 | | 853.92
2 | | 445.10
1 | | 199.38
| Canada | 1330.96
| Australia | 1091.30
Revenue Total, by Month, by Country, by Month & Country All in one shot, NAIVE way:
The naive way to create the same table as above is to write several queries and UNION them together. Grouping sets and cubes produce queries that are shorter to write, easier to read, and more performant. Run the naive query below and compare the time it takes to run to the time it takes the cube query to run
SELECT NULL as month, NULL as country, sum(sales_amount) as revenue
FROM factSales
UNION all
SELECT NULL, dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by dimStore.country
UNION all
SELECT cast(dimDate.month as text) , NULL, sum(sales_amount) as revenue
FROM factSales
JOIN dimDate on (dimDate.date_key = factSales.date_key)
GROUP by dimDate.month
UNION all
SELECT cast(dimDate.month as text),dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate on (dimDate.date_key = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by (dimDate.month, dimStore.country);
Output:
month | country | revenue
-------+-----------+---------
| | 2422.26
| Canada | 1330.96
| Australia | 1091.30
3 | | 901.96
5 | | 21.90
4 | | 853.92
2 | | 445.10
1 | | 199.38
5 | Canada | 11.96
3 | Canada | 530.80
2 | Australia | 213.52
5 | Australia | 9.94
2 | Canada | 231.58
1 | Australia | 135.60
4 | Australia | 361.08
3 | Australia | 371.16
4 | Canada | 492.84
1 | Canada | 63.78