intermediate SQL

GROUP BY

What does GROUP BY do?

  • groups rows into subgroups, GROUP BY returns one row for each group.

  • reduces the number of rows in the result set

I’m using count(*) to show how many of each value of (shipped, Resolved, Cancelled, …) are in the table.

SELECT status, count(*)
FROM orders
GROUP BY status;
+------------+----------+
| status     | count(*) |
+------------+----------+
| Shipped    |      303 |
| Resolved   |        4 |
| Cancelled  |        6 |
| On Hold    |        4 |
| Disputed   |        3 |
| In Process |        6 |
+------------+----------+
  • We can see that the GROUP BY clause returns unique occurrences of status values

Could also just do this and GROUP BY Status

SELECT status
FROM orders
GROUP BY status;

Output:

+------------+
| status     |
+------------+
| Shipped    |
| Resolved   |
| Cancelled  |
| On Hold    |
| Disputed   |
| In Process |
+------------+
SELECT 
    status, 
    SUM(quantityOrdered * priceEach) AS amount
FROM
    orders
INNER JOIN orderdetails ON orders.orderNumber = orderdetails.orderNumber

GROUP BY 
    status;

Output:

+------------+------------+
| status     | amount     |
+------------+------------+
| Shipped    | 8865094.64 |
| Resolved   |  134235.88 |
| Cancelled  |  238854.18 |
| On Hold    |  169575.61 |
| Disputed   |   61158.78 |
| In Process |  135271.52 |
+------------+------------+

ORDER BY

  • Sorts rows in the result set that you want
SELECT
	contactLastname,
	contactFirstname
FROM
	customers
ORDER BY
	contactLastname;

Output:

+-----------------+------------------+
| contactLastname | contactFirstname |
+-----------------+------------------+
| Accorti         | Paolo            |
| Altagar,G M     | Raanan           |
| Andersen        | Mel              |
| Anton           | Carmen           |
| Ashworth        | Rachel           |
| Barajas         | Miguel           |
| Benitez         | Violeta          |
| Bennett         | Helen            |
| Berglund        | Christina        |

Difference Between GROUP BY and ORDER BY

  • ORDER BY clause’s purpose is to sort the query result by specific columns
  • GROUP BY clause’s purpose is summarize unique combinations of columns values.

Question: Differences in keyword WHERE and HAVING

You use WHERE before you use GROUP BY to organize data, and HAVING after you’ve used GROUP BY statement.

Case Statments

  • Case statement allows you to specify a condition and then it also allows you to specify what you want returned when that condition is met.

Lets First Describe the two tables that we will work with for these examples:

EmployeeDemographics Table

+------------+-----------+------------+------+--------+
| EmployeeID | FirstName | LastName   | Age  | Gender |
+------------+-----------+------------+------+--------+
|       1001 | Jim       | Halpert    |   30 | Male   |
|       1002 | Pam       | Beasley    |   30 | Female |
|       1003 | Dwight    | Schrute    |   29 | Male   |
|       1004 | Angela    | Martin     |   31 | Female |
|       1005 | Toby      | Flenderson |   32 | Male   |
|       1006 | Michael   | Scott      |   35 | Male   |
|       1007 | Meredith  | Palmer     |   32 | Female |
|       1008 | Stanley   | Hudson     |   38 | Male   |
|       1009 | Kevin     | Malone     |   31 | Male   |
|       1010 | Ryan      | Howard     |   26 | Male   |
+------------+-----------+------------+------+--------+

EmployeeSalary Table

+------------+--------------------+--------+
| EmployeeID | JobTitle           | Salary |
+------------+--------------------+--------+
|       1001 | Salesman           |  45000 |
|       1002 | Receptionist       |  36000 |
|       1003 | Salesman           |  63000 |
|       1004 | Accountant         |  47000 |
|       1005 | HR                 |  50000 |
|       1006 | Regional Manager   |  65000 |
|       1007 | Supplier Relations |  41000 |
|       1008 | Salesman           |  48000 |
|       1009 | Accountant         |  42000 |
+------------+--------------------+--------+
SELECT FirstName, LastName, Age
FROM EmployeeDemographics
WHERE Age is NOT NULL
ORDER BY Age;

Output

+-----------+------------+------+
| FirstName | LastName   | Age  |
+-----------+------------+------+
| Ryan      | Howard     |   26 |
| Dwight    | Schrute    |   29 |
| Jim       | Halpert    |   30 |
| Pam       | Beasley    |   30 |
| Angela    | Martin     |   31 |
| Kevin     | Malone     |   31 |
| Toby      | Flenderson |   32 |
| Meredith  | Palmer     |   32 |
| Michael   | Scott      |   35 |
| Stanley   | Hudson     |   38 |
+-----------+------------+------+

The syntax is simple,

CASE
    WHEN

    ELSE 
END  

Use END when we want to specify were down with the case statement.

First Case Statement

  • Use the Keyword CASE to start and end with END keyword. Have some statments in the case statement

NOTE: The very first condition at the top will be returned, so we can show this below. Stanley will be returned the word “old” because the first statement was evaluated.

SELECT FirstName, LastName, Age,
CASE

	WHEN Age > 30 THEN 'Old'
    WHEN Age = 38 THEN 'Stanley'
    ELSE 'Baby' 
END AS CaseStatment
FROM EmployeeDemographics
WHERE Age is NOT NULL
ORDER BY Age;

Output

+-----------+------------+------+--------------+
| FirstName | LastName   | Age  | CaseStatment |
+-----------+------------+------+--------------+
| Ryan      | Howard     |   26 | Baby         |
| Dwight    | Schrute    |   29 | Baby         |
| Jim       | Halpert    |   30 | Baby         |
| Pam       | Beasley    |   30 | Baby         |
| Angela    | Martin     |   31 | Old          |
| Kevin     | Malone     |   31 | Old          |
| Toby      | Flenderson |   32 | Old          |
| Meredith  | Palmer     |   32 | Old          |
| Michael   | Scott      |   35 | Old          |
| Stanley   | Hudson     |   38 | Old          |
+-----------+------------+------+--------------+

If we change the WHEN Clauses around like so:

SELECT FirstName, LastName, Age,
CASE 
    WHEN Age = 38 THEN 'Stanley'
    WHEN Age > 30 THEN 'Old'
    ELSE 'Baby' 
END AS CaseStatment
FROM EmployeeDemographics
WHERE Age is NOT NULL
ORDER BY Age;

Output

+-----------+------------+------+--------------+
| FirstName | LastName   | Age  | CaseStatment |
+-----------+------------+------+--------------+
| Ryan      | Howard     |   26 | Baby         |
| Dwight    | Schrute    |   29 | Baby         |
| Jim       | Halpert    |   30 | Baby         |
| Pam       | Beasley    |   30 | Baby         |
| Angela    | Martin     |   31 | Old          |
| Kevin     | Malone     |   31 | Old          |
| Toby      | Flenderson |   32 | Old          |
| Meredith  | Palmer     |   32 | Old          |
| Michael   | Scott      |   35 | Old          |
| Stanley   | Hudson     |   38 | Stanley      |
+-----------+------------+------+--------------+

Then, boom Stanley CaseStatment returns Stanley.

Another Example

SELECT FirstName, LastName, JobTitle, Salary,
CASE
	WHEN JobTitle = 'Salesman' THEN Salary + (Salary * .10)
	WHEN JobTitle = 'Accountant' THEN Salary + (Salary * .05)
	WHEN JobTitle = 'HR' THEN Salary + (Salary * .000001)
	ELSE Salary + (Salary * .03)
END AS SalaryAfterRaise

FROM EmployeeDemographics
JOIN EmployeeSalary ON EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID;

Output

+-----------+------------+--------------------+--------+------------------+
| FirstName | LastName   | JobTitle           | Salary | SalaryAfterRaise |
+-----------+------------+--------------------+--------+------------------+
| Jim       | Halpert    | Salesman           |  45000 |         49500.00 |
| Pam       | Beasley    | Receptionist       |  36000 |         37080.00 |
| Dwight    | Schrute    | Salesman           |  63000 |         69300.00 |
| Angela    | Martin     | Accountant         |  47000 |         49350.00 |
| Toby      | Flenderson | HR                 |  50000 |     50000.050000 |
| Michael   | Scott      | Regional Manager   |  65000 |         66950.00 |
| Meredith  | Palmer     | Supplier Relations |  41000 |         42230.00 |
| Stanley   | Hudson     | Salesman           |  48000 |         52800.00 |
| Kevin     | Malone     | Accountant         |  42000 |         44100.00 |
+-----------+------------+--------------------+--------+------------------+

More Examples of Case Statements

  • Switching database

Having Clause

  • What it’s used for
SELECT JobTitle, COUNT(JobTitle)
FROM EmployeeDemographics 
JOIN EmployeeSalary ON EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID
GROUP BY JobTitle
HAVING COUNT(JobTitle) > 1;

Output:

+------------+-----------------+
| JobTitle   | COUNT(JobTitle) |
+------------+-----------------+
| Salesman   |               3 |
| Accountant |               2 |
+------------+-----------------+
SELECT JobTitle, AVG(Salary)
FROM EmployeeDemographics 
JOIN EmployeeSalary ON EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID
GROUP BY JobTitle
ORDER BY AVG(Salary)

Output:

+--------------------+-------------+
| JobTitle           | AVG(Salary) |
+--------------------+-------------+
| Receptionist       |  36000.0000 |
| Supplier Relations |  41000.0000 |
| Accountant         |  44500.0000 |
| HR                 |  50000.0000 |
| Salesman           |  52000.0000 |
| Regional Manager   |  65000.0000 |
+--------------------+-------------+
SELECT JobTitle, AVG(Salary)
FROM EmployeeDemographics 
JOIN EmployeeSalary ON EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID
GROUP BY JobTitle
HAVING AVG(Salary) > 45000
ORDER BY AVG(Salary);

Output:

+------------------+-------------+
| JobTitle         | AVG(Salary) |
+------------------+-------------+
| HR               |  50000.0000 |
| Salesman         |  52000.0000 |
| Regional Manager |  65000.0000 |
+------------------+-------------+

Now we know what Jobs have a average salary above $45,000!

Partition By

  • Compared to the GROUP BY Statement

  • Divides into partitions

SELECT FirstName, LastName, Gender, Salary,
COUNT(Gender) OVER (PARTITION BY Gender) AS TotalGender
FROM EmployeeDemographics 
JOIN EmployeeSalary ON EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID;

Output:

+-----------+------------+--------+--------+-------------+
| FirstName | LastName   | Gender | Salary | TotalGender |
+-----------+------------+--------+--------+-------------+
| Pam       | Beasley    | Female |  36000 |           3 |
| Angela    | Martin     | Female |  47000 |           3 |
| Meredith  | Palmer     | Female |  41000 |           3 |
| Jim       | Halpert    | Male   |  45000 |           6 |
| Dwight    | Schrute    | Male   |  63000 |           6 |
| Toby      | Flenderson | Male   |  50000 |           6 |
| Michael   | Scott      | Male   |  65000 |           6 |
| Stanley   | Hudson     | Male   |  48000 |           6 |
| Kevin     | Malone     | Male   |  42000 |           6 |
+-----------+------------+--------+--------+-------------+

We can see in our TotalGender theres 3

  • Isolate Columns using Partition BY

Lets see the Difference is:

SELECT Gender, COUNT(Gender)
FROM EmployeeDemographics 
GROUP BY Gender;

Output:

+--------+---------------+
| Gender | COUNT(Gender) |
+--------+---------------+
| Male   |             7 |
| Female |             3 |
+--------+---------------+

Difference Between GROUP BY and PARTITION BY

A GROUP BY normally reduces the number of rows returned by rolling them up and calculating averages or sums for each row. PARTITION BY does not affect the number of rows returned, but it changes how a window function’s result is calculated.

CTE: Common Table Expression

  • Only created in memory, acts like a subquery

  • CTE is a temporary result set that exists only within the execution scope of a single SQL statement

Syntax:

WITH cte_name (column_list) AS (
    query
) 
SELECT * FROM cte_name;
WITH CTE_Employee as
	(SELECT FirstName, LastName, Gender, Salary, COUNT(Gender) OVER (PARTITION BY Gender) AS TotalGender,
	AVG(Salary) OVER (PARTITION BY Gender) AS AvgSalary
    
	FROM EmployeeDemographics
    
JOIN EmployeeSalary on EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID
WHERE Salary > '45000'
)
SELECT FirstName, AvgSalary
FROM
CTE_Employee;

Output:

+-----------+------------+
| FirstName | AvgSalary  |
+-----------+------------+
| Angela    | 47000.0000 |
| Dwight    | 56500.0000 |
| Toby      | 56500.0000 |
| Michael   | 56500.0000 |
| Stanley   | 56500.0000 |
+-----------+------------+

Subqueries

  • Query inside a query

  • Used to return data that will be used in the main query as a condtion to further restrict the data to be retrieved

  • Like a window function

Examples

SELECT EmployeeID, Salary, (SELECT AVG(Salary) FROM EmployeeSalary) AS AllAvgSalary
FROM EmployeeSalary;

Lets get a closer look at the subquery!

Select AVG(Salary) FROM EmployeeSalary;

Output

+-------------+
| AVG(Salary) |
+-------------+
|  48555.5556 |
+-------------+

What will happen when we run this?:

SELECT EmployeeID, Salary, (SELECT AVG(Salary) FROM EmployeeSalary) AS AllAvgSalary
FROM EmployeeSalary;

Output

+------------+--------+--------------+
| EmployeeID | Salary | AllAvgSalary |
+------------+--------+--------------+
|       1001 |  45000 |   48555.5556 |
|       1002 |  36000 |   48555.5556 |
|       1003 |  63000 |   48555.5556 |
|       1004 |  47000 |   48555.5556 |
|       1005 |  50000 |   48555.5556 |
|       1006 |  65000 |   48555.5556 |
|       1007 |  41000 |   48555.5556 |
|       1008 |  48000 |   48555.5556 |
|       1009 |  42000 |   48555.5556 |
+------------+--------+--------------+

Let’s show that we can do it with Partition By

SELECT EmployeeID, Salary, AVG(Salary) OVER () AS AllAvgSalary
FROM EmployeeSalary;

Output

+------------+--------+--------------+
| EmployeeID | Salary | AllAvgSalary |
+------------+--------+--------------+
|       1001 |  45000 |   48555.5556 |
|       1002 |  36000 |   48555.5556 |
|       1003 |  63000 |   48555.5556 |
|       1004 |  47000 |   48555.5556 |
|       1005 |  50000 |   48555.5556 |
|       1006 |  65000 |   48555.5556 |
|       1007 |  41000 |   48555.5556 |
|       1008 |  48000 |   48555.5556 |
|       1009 |  42000 |   48555.5556 |
+------------+--------+--------------+

GROUP BY doesn’t work for this, why?

Select EmployeeID, Salary, AVG(Salary) as AllAvgSalary
From EmployeeSalary
Group By EmployeeID, Salary
order by EmployeeID

Output

+------------+--------+--------------+
| EmployeeID | Salary | AllAvgSalary |
+------------+--------+--------------+
|       1001 |  45000 |   45000.0000 |
|       1002 |  36000 |   36000.0000 |
|       1003 |  63000 |   63000.0000 |
|       1004 |  47000 |   47000.0000 |
|       1005 |  50000 |   50000.0000 |
|       1006 |  65000 |   65000.0000 |
|       1007 |  41000 |   41000.0000 |
|       1008 |  48000 |   48000.0000 |
|       1009 |  42000 |   42000.0000 |
+------------+--------+--------------+
  • GROUP BY won’t work like it does with using a Subquery or Partition By

Subquery in the FROM statement

SELECT EmployeeID, AllAvgSalary
FROM 
	(SELECT EmployeeID, Salary, AVG(Salary) OVER () AS AllAvgSalary
	 FROM EmployeeSalary) A
ORDER BY A.EmployeeID

Note: A is just an alias, Every derived table must have its own alias!!

  • Creating a table and querying off of it!
  • Have to write out a subquery each time

Lets look at the Inner Query first

SELECT EmployeeID, Salary, AVG(Salary) OVER () AS AllAvgSalary
FROM EmployeeSalary

Output:

+------------+--------+--------------+
| EmployeeID | Salary | AllAvgSalary |
+------------+--------+--------------+
|       1001 |  45000 |   48555.5556 |
|       1002 |  36000 |   48555.5556 |
|       1003 |  63000 |   48555.5556 |
|       1004 |  47000 |   48555.5556 |
|       1005 |  50000 |   48555.5556 |
|       1006 |  65000 |   48555.5556 |
|       1007 |  41000 |   48555.5556 |
|       1008 |  48000 |   48555.5556 |
|       1009 |  42000 |   48555.5556 |
+------------+--------+--------------+

Now lets look at the full output of:

SELECT EmployeeID, AllAvgSalary
FROM 
	(SELECT EmployeeID, Salary, AVG(Salary) OVER () AS AllAvgSalary
	 FROM EmployeeSalary) A
ORDER BY A.EmployeeID

Output:

+------------+--------------+
| EmployeeID | AllAvgSalary |
+------------+--------------+
|       1001 |   48555.5556 |
|       1002 |   48555.5556 |
|       1003 |   48555.5556 |
|       1004 |   48555.5556 |
|       1005 |   48555.5556 |
|       1006 |   48555.5556 |
|       1007 |   48555.5556 |
|       1008 |   48555.5556 |
|       1009 |   48555.5556 |
+------------+--------------+

Subquery in WHERE Statement

  • We want age, but age is in another table, so we write a subquery to return.

Lets look at the inside query output

SELECT EmployeeID, JobTitle, Salary
FROM EmployeeSalary
WHERE EmployeeID IN (
	SELECT EmployeeID 
	FROM EmployeeDemographics
	WHERE Age > 30);
SELECT EmployeeID 
	FROM EmployeeDemographics
	WHERE Age > 30;

Output:

+------------+
| EmployeeID |
+------------+
|       1004 |
|       1005 |
|       1006 |
|       1007 |
|       1008 |
|       1009 |
+------------+

Now Lets run the Full Query Above:

Output:

+------------+--------------------+--------+
| EmployeeID | JobTitle           | Salary |
+------------+--------------------+--------+
|       1004 | Accountant         |  47000 |
|       1005 | HR                 |  50000 |
|       1006 | Regional Manager   |  65000 |
|       1007 | Supplier Relations |  41000 |
|       1008 | Salesman           |  48000 |
|       1009 | Accountant         |  42000 |
+------------+--------------------+--------+

Temp Tables

Windows Function

A MySQL window function is a function that uses basic queries to manipulate row values. Window functions must have an OVER clause.

Sales Tables

+----------------+-------------+--------+
| sales_employee | fiscal_year | sale   |
+----------------+-------------+--------+
| Alice          |        2016 | 150.00 |
| Alice          |        2017 | 100.00 |
| Alice          |        2018 | 200.00 |
| Bob            |        2016 | 100.00 |
| Bob            |        2017 | 150.00 |
| Bob            |        2018 | 200.00 |
| John           |        2016 | 200.00 |
| John           |        2017 | 150.00 |
| John           |        2018 | 250.00 |
+----------------+-------------+--------+

Remember our aggregate functions which are functions that performs a calculation on multiple values and returns a single value.

Always use the over function with a Window Function

Window Function Syntax:

window_function_name(expression) OVER ( 
   [partition_defintion]
   [order_definition]
   [frame_definition]
)

Lets use our Sales table in a example:

SELECT 
    fiscal_year, 
    sales_employee,
    sale,
    SUM(sale) OVER (PARTITION BY fiscal_year) total_sales
FROM
    sales;

Output

+-------------+----------------+--------+-------------+
| fiscal_year | sales_employee | sale   | total_sales |
+-------------+----------------+--------+-------------+
|        2016 | Alice          | 150.00 |      450.00 |
|        2016 | Bob            | 100.00 |      450.00 |
|        2016 | John           | 200.00 |      450.00 |
|        2017 | Alice          | 100.00 |      400.00 |
|        2017 | Bob            | 150.00 |      400.00 |
|        2017 | John           | 150.00 |      400.00 |
|        2018 | Alice          | 200.00 |      650.00 |
|        2018 | Bob            | 200.00 |      650.00 |
|        2018 | John           | 250.00 |      650.00 |
+-------------+----------------+--------+-------------+

Another Window Function Example with row_number

Lets say we have this table below:

+------+------+
| id   | name |
+------+------+
|    1 | A    |
|    2 | B    |
|    2 | B    |
|    3 | C    |
|    3 | C    |
|    3 | C    |
|    4 | D    |
+------+------+

Lets use row_number in a window function to divide the rows into partitions by all columns. The row number will restart for each unique set of rows.

SELECT 
    id,
    name,
    ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY id) AS row_num
FROM t;

Output:

+------+------+---------+
| id   | name | row_num |
+------+------+---------+
|    1 | A    |       1 |
|    2 | B    |       1 |
|    2 | B    |       2 |
|    3 | C    |       1 |
|    3 | C    |       2 |
|    3 | C    |       3 |
|    4 | D    |       1 |
+------+------+---------+

We can see that row_num is assigned (1,2 , and 3) to each name

Another row_number Example:

Person Table:

+---------+----------+---------+------+
| Name    | Product  | Country | Year |
+---------+----------+---------+------+
| Stephen | Computer | USA     | 2015 |
| Joseph  | Laptop   | India   | 2016 |
| John    | TV       | USA     | 2016 |
| Donald  | Laptop   | England | 2015 |
| Joseph  | Mobile   | India   | 2015 |
| Peter   | Mouse    | England | 2016 |
+---------+----------+---------+------+

We can use the ROW_NUMBER() function to assign a sequence number for each record

SELECT *,   
    ROW_NUMBER() OVER(PARTITION BY Year) AS row_num  
FROM Person;  

Output

+---------+----------+---------+------+---------+
| Name    | Product  | Country | Year | row_num |
+---------+----------+---------+------+---------+
| Stephen | Computer | USA     | 2015 |       1 |
| Donald  | Laptop   | England | 2015 |       2 |
| Joseph  | Mobile   | India   | 2015 |       3 |
| Joseph  | Laptop   | India   | 2016 |       1 |
| John    | TV       | USA     | 2016 |       2 |
| Peter   | Mouse    | England | 2016 |       3 |
+---------+----------+---------+------+---------+

Rank

Rank returns the position of any row inside the partition.

Lets look at our Employee Table

+-------------+-----------------+------------+-----------+
| employee_id | full_name       | department | salary    |
+-------------+-----------------+------------+-----------+
|         100 | Mary Johns      | SALES      | 1000.0000 |
|         101 | Sean Moldy      | IT         | 1500.0000 |
|         102 | Peter Dugan     | SALES      | 2000.0000 |
|         103 | Lilian Penn     | SALES      | 1700.0000 |
|         104 | Milton Kowarsky | IT         | 1800.0000 |
|         105 | Mareen Bisset   | ACCOUNTS   | 1200.0000 |
|         106 | Airton Graue    | ACCOUNTS   | 1100.0000 |
+-------------+-----------------+------------+-----------+

Lets start using the window function RANK and rank salaries within the departments.

SELECT	
		RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_ranking,
		department,
		employee_id, 
		full_name, 
		salary
FROM Employee;

Output:

+--------------+------------+-------------+-----------------+-----------+
| dept_ranking | department | employee_id | full_name       | salary    |
+--------------+------------+-------------+-----------------+-----------+
|            1 | ACCOUNTS   |         105 | Mareen Bisset   | 1200.0000 |
|            2 | ACCOUNTS   |         106 | Airton Graue    | 1100.0000 |
|            1 | IT         |         104 | Milton Kowarsky | 1800.0000 |
|            2 | IT         |         101 | Sean Moldy      | 1500.0000 |
|            1 | SALES      |         102 | Peter Dugan     | 2000.0000 |
|            2 | SALES      |         103 | Lilian Penn     | 1700.0000 |
|            3 | SALES      |         100 | Mary Johns      | 1000.0000 |
+--------------+------------+-------------+-----------------+-----------+