intermediate SQL
GROUP BY
What does GROUP BY do?
-
groups rows into subgroups, GROUP BY returns one row for each group.
-
reduces the number of rows in the result set
I’m using count(*) to show how many of each value of (shipped, Resolved, Cancelled, …) are in the table.
SELECT status, count(*)
FROM orders
GROUP BY status;
+------------+----------+
| status | count(*) |
+------------+----------+
| Shipped | 303 |
| Resolved | 4 |
| Cancelled | 6 |
| On Hold | 4 |
| Disputed | 3 |
| In Process | 6 |
+------------+----------+
- We can see that the GROUP BY clause returns unique occurrences of status values
Could also just do this and GROUP BY Status
SELECT status
FROM orders
GROUP BY status;
Output:
+------------+
| status |
+------------+
| Shipped |
| Resolved |
| Cancelled |
| On Hold |
| Disputed |
| In Process |
+------------+
SELECT
status,
SUM(quantityOrdered * priceEach) AS amount
FROM
orders
INNER JOIN orderdetails ON orders.orderNumber = orderdetails.orderNumber
GROUP BY
status;
Output:
+------------+------------+
| status | amount |
+------------+------------+
| Shipped | 8865094.64 |
| Resolved | 134235.88 |
| Cancelled | 238854.18 |
| On Hold | 169575.61 |
| Disputed | 61158.78 |
| In Process | 135271.52 |
+------------+------------+
ORDER BY
- Sorts rows in the result set that you want
SELECT
contactLastname,
contactFirstname
FROM
customers
ORDER BY
contactLastname;
Output:
+-----------------+------------------+
| contactLastname | contactFirstname |
+-----------------+------------------+
| Accorti | Paolo |
| Altagar,G M | Raanan |
| Andersen | Mel |
| Anton | Carmen |
| Ashworth | Rachel |
| Barajas | Miguel |
| Benitez | Violeta |
| Bennett | Helen |
| Berglund | Christina |
Difference Between GROUP BY and ORDER BY
- ORDER BY clause’s purpose is to sort the query result by specific columns
- GROUP BY clause’s purpose is summarize unique combinations of columns values.
Question: Differences in keyword WHERE and HAVING
You use WHERE before you use GROUP BY to organize data, and HAVING after you’ve used GROUP BY statement.
Case Statments
- Case statement allows you to specify a condition and then it also allows you to specify what you want returned when that condition is met.
Lets First Describe the two tables that we will work with for these examples:
EmployeeDemographics Table
+------------+-----------+------------+------+--------+
| EmployeeID | FirstName | LastName | Age | Gender |
+------------+-----------+------------+------+--------+
| 1001 | Jim | Halpert | 30 | Male |
| 1002 | Pam | Beasley | 30 | Female |
| 1003 | Dwight | Schrute | 29 | Male |
| 1004 | Angela | Martin | 31 | Female |
| 1005 | Toby | Flenderson | 32 | Male |
| 1006 | Michael | Scott | 35 | Male |
| 1007 | Meredith | Palmer | 32 | Female |
| 1008 | Stanley | Hudson | 38 | Male |
| 1009 | Kevin | Malone | 31 | Male |
| 1010 | Ryan | Howard | 26 | Male |
+------------+-----------+------------+------+--------+
EmployeeSalary Table
+------------+--------------------+--------+
| EmployeeID | JobTitle | Salary |
+------------+--------------------+--------+
| 1001 | Salesman | 45000 |
| 1002 | Receptionist | 36000 |
| 1003 | Salesman | 63000 |
| 1004 | Accountant | 47000 |
| 1005 | HR | 50000 |
| 1006 | Regional Manager | 65000 |
| 1007 | Supplier Relations | 41000 |
| 1008 | Salesman | 48000 |
| 1009 | Accountant | 42000 |
+------------+--------------------+--------+
SELECT FirstName, LastName, Age
FROM EmployeeDemographics
WHERE Age is NOT NULL
ORDER BY Age;
Output
+-----------+------------+------+
| FirstName | LastName | Age |
+-----------+------------+------+
| Ryan | Howard | 26 |
| Dwight | Schrute | 29 |
| Jim | Halpert | 30 |
| Pam | Beasley | 30 |
| Angela | Martin | 31 |
| Kevin | Malone | 31 |
| Toby | Flenderson | 32 |
| Meredith | Palmer | 32 |
| Michael | Scott | 35 |
| Stanley | Hudson | 38 |
+-----------+------------+------+
The syntax is simple,
CASE
WHEN
ELSE
END
Use END when we want to specify were down with the case statement.
First Case Statement
- Use the Keyword CASE to start and end with END keyword. Have some statments in the case statement
NOTE: The very first condition at the top will be returned, so we can show this below. Stanley will be returned the word “old” because the first statement was evaluated.
SELECT FirstName, LastName, Age,
CASE
WHEN Age > 30 THEN 'Old'
WHEN Age = 38 THEN 'Stanley'
ELSE 'Baby'
END AS CaseStatment
FROM EmployeeDemographics
WHERE Age is NOT NULL
ORDER BY Age;
Output
+-----------+------------+------+--------------+
| FirstName | LastName | Age | CaseStatment |
+-----------+------------+------+--------------+
| Ryan | Howard | 26 | Baby |
| Dwight | Schrute | 29 | Baby |
| Jim | Halpert | 30 | Baby |
| Pam | Beasley | 30 | Baby |
| Angela | Martin | 31 | Old |
| Kevin | Malone | 31 | Old |
| Toby | Flenderson | 32 | Old |
| Meredith | Palmer | 32 | Old |
| Michael | Scott | 35 | Old |
| Stanley | Hudson | 38 | Old |
+-----------+------------+------+--------------+
If we change the WHEN Clauses around like so:
SELECT FirstName, LastName, Age,
CASE
WHEN Age = 38 THEN 'Stanley'
WHEN Age > 30 THEN 'Old'
ELSE 'Baby'
END AS CaseStatment
FROM EmployeeDemographics
WHERE Age is NOT NULL
ORDER BY Age;
Output
+-----------+------------+------+--------------+
| FirstName | LastName | Age | CaseStatment |
+-----------+------------+------+--------------+
| Ryan | Howard | 26 | Baby |
| Dwight | Schrute | 29 | Baby |
| Jim | Halpert | 30 | Baby |
| Pam | Beasley | 30 | Baby |
| Angela | Martin | 31 | Old |
| Kevin | Malone | 31 | Old |
| Toby | Flenderson | 32 | Old |
| Meredith | Palmer | 32 | Old |
| Michael | Scott | 35 | Old |
| Stanley | Hudson | 38 | Stanley |
+-----------+------------+------+--------------+
Then, boom Stanley CaseStatment returns Stanley.
Another Example
SELECT FirstName, LastName, JobTitle, Salary,
CASE
WHEN JobTitle = 'Salesman' THEN Salary + (Salary * .10)
WHEN JobTitle = 'Accountant' THEN Salary + (Salary * .05)
WHEN JobTitle = 'HR' THEN Salary + (Salary * .000001)
ELSE Salary + (Salary * .03)
END AS SalaryAfterRaise
FROM EmployeeDemographics
JOIN EmployeeSalary ON EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID;
Output
+-----------+------------+--------------------+--------+------------------+
| FirstName | LastName | JobTitle | Salary | SalaryAfterRaise |
+-----------+------------+--------------------+--------+------------------+
| Jim | Halpert | Salesman | 45000 | 49500.00 |
| Pam | Beasley | Receptionist | 36000 | 37080.00 |
| Dwight | Schrute | Salesman | 63000 | 69300.00 |
| Angela | Martin | Accountant | 47000 | 49350.00 |
| Toby | Flenderson | HR | 50000 | 50000.050000 |
| Michael | Scott | Regional Manager | 65000 | 66950.00 |
| Meredith | Palmer | Supplier Relations | 41000 | 42230.00 |
| Stanley | Hudson | Salesman | 48000 | 52800.00 |
| Kevin | Malone | Accountant | 42000 | 44100.00 |
+-----------+------------+--------------------+--------+------------------+
More Examples of Case Statements
- Switching database
Having Clause
- What it’s used for
SELECT JobTitle, COUNT(JobTitle)
FROM EmployeeDemographics
JOIN EmployeeSalary ON EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID
GROUP BY JobTitle
HAVING COUNT(JobTitle) > 1;
Output:
+------------+-----------------+
| JobTitle | COUNT(JobTitle) |
+------------+-----------------+
| Salesman | 3 |
| Accountant | 2 |
+------------+-----------------+
SELECT JobTitle, AVG(Salary)
FROM EmployeeDemographics
JOIN EmployeeSalary ON EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID
GROUP BY JobTitle
ORDER BY AVG(Salary)
Output:
+--------------------+-------------+
| JobTitle | AVG(Salary) |
+--------------------+-------------+
| Receptionist | 36000.0000 |
| Supplier Relations | 41000.0000 |
| Accountant | 44500.0000 |
| HR | 50000.0000 |
| Salesman | 52000.0000 |
| Regional Manager | 65000.0000 |
+--------------------+-------------+
SELECT JobTitle, AVG(Salary)
FROM EmployeeDemographics
JOIN EmployeeSalary ON EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID
GROUP BY JobTitle
HAVING AVG(Salary) > 45000
ORDER BY AVG(Salary);
Output:
+------------------+-------------+
| JobTitle | AVG(Salary) |
+------------------+-------------+
| HR | 50000.0000 |
| Salesman | 52000.0000 |
| Regional Manager | 65000.0000 |
+------------------+-------------+
Now we know what Jobs have a average salary above $45,000!
Partition By
-
Compared to the GROUP BY Statement
-
Divides into partitions
SELECT FirstName, LastName, Gender, Salary,
COUNT(Gender) OVER (PARTITION BY Gender) AS TotalGender
FROM EmployeeDemographics
JOIN EmployeeSalary ON EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID;
Output:
+-----------+------------+--------+--------+-------------+
| FirstName | LastName | Gender | Salary | TotalGender |
+-----------+------------+--------+--------+-------------+
| Pam | Beasley | Female | 36000 | 3 |
| Angela | Martin | Female | 47000 | 3 |
| Meredith | Palmer | Female | 41000 | 3 |
| Jim | Halpert | Male | 45000 | 6 |
| Dwight | Schrute | Male | 63000 | 6 |
| Toby | Flenderson | Male | 50000 | 6 |
| Michael | Scott | Male | 65000 | 6 |
| Stanley | Hudson | Male | 48000 | 6 |
| Kevin | Malone | Male | 42000 | 6 |
+-----------+------------+--------+--------+-------------+
We can see in our TotalGender theres 3
- Isolate Columns using Partition BY
Lets see the Difference is:
SELECT Gender, COUNT(Gender)
FROM EmployeeDemographics
GROUP BY Gender;
Output:
+--------+---------------+
| Gender | COUNT(Gender) |
+--------+---------------+
| Male | 7 |
| Female | 3 |
+--------+---------------+
Difference Between GROUP BY and PARTITION BY
A GROUP BY normally reduces the number of rows returned by rolling them up and calculating averages or sums for each row. PARTITION BY does not affect the number of rows returned, but it changes how a window function’s result is calculated.
CTE: Common Table Expression
-
Only created in memory, acts like a subquery
-
CTE is a temporary result set that exists only within the execution scope of a single SQL statement
Syntax:
WITH cte_name (column_list) AS (
query
)
SELECT * FROM cte_name;
WITH CTE_Employee as
(SELECT FirstName, LastName, Gender, Salary, COUNT(Gender) OVER (PARTITION BY Gender) AS TotalGender,
AVG(Salary) OVER (PARTITION BY Gender) AS AvgSalary
FROM EmployeeDemographics
JOIN EmployeeSalary on EmployeeDemographics.EmployeeID = EmployeeSalary.EmployeeID
WHERE Salary > '45000'
)
SELECT FirstName, AvgSalary
FROM
CTE_Employee;
Output:
+-----------+------------+
| FirstName | AvgSalary |
+-----------+------------+
| Angela | 47000.0000 |
| Dwight | 56500.0000 |
| Toby | 56500.0000 |
| Michael | 56500.0000 |
| Stanley | 56500.0000 |
+-----------+------------+
Subqueries
-
Query inside a query
-
Used to return data that will be used in the main query as a condtion to further restrict the data to be retrieved
-
Like a window function
Examples
SELECT EmployeeID, Salary, (SELECT AVG(Salary) FROM EmployeeSalary) AS AllAvgSalary
FROM EmployeeSalary;
Lets get a closer look at the subquery!
Select AVG(Salary) FROM EmployeeSalary;
Output
+-------------+
| AVG(Salary) |
+-------------+
| 48555.5556 |
+-------------+
What will happen when we run this?:
SELECT EmployeeID, Salary, (SELECT AVG(Salary) FROM EmployeeSalary) AS AllAvgSalary
FROM EmployeeSalary;
Output
+------------+--------+--------------+
| EmployeeID | Salary | AllAvgSalary |
+------------+--------+--------------+
| 1001 | 45000 | 48555.5556 |
| 1002 | 36000 | 48555.5556 |
| 1003 | 63000 | 48555.5556 |
| 1004 | 47000 | 48555.5556 |
| 1005 | 50000 | 48555.5556 |
| 1006 | 65000 | 48555.5556 |
| 1007 | 41000 | 48555.5556 |
| 1008 | 48000 | 48555.5556 |
| 1009 | 42000 | 48555.5556 |
+------------+--------+--------------+
Let’s show that we can do it with Partition By
SELECT EmployeeID, Salary, AVG(Salary) OVER () AS AllAvgSalary
FROM EmployeeSalary;
Output
+------------+--------+--------------+
| EmployeeID | Salary | AllAvgSalary |
+------------+--------+--------------+
| 1001 | 45000 | 48555.5556 |
| 1002 | 36000 | 48555.5556 |
| 1003 | 63000 | 48555.5556 |
| 1004 | 47000 | 48555.5556 |
| 1005 | 50000 | 48555.5556 |
| 1006 | 65000 | 48555.5556 |
| 1007 | 41000 | 48555.5556 |
| 1008 | 48000 | 48555.5556 |
| 1009 | 42000 | 48555.5556 |
+------------+--------+--------------+
GROUP BY doesn’t work for this, why?
Select EmployeeID, Salary, AVG(Salary) as AllAvgSalary
From EmployeeSalary
Group By EmployeeID, Salary
order by EmployeeID
Output
+------------+--------+--------------+
| EmployeeID | Salary | AllAvgSalary |
+------------+--------+--------------+
| 1001 | 45000 | 45000.0000 |
| 1002 | 36000 | 36000.0000 |
| 1003 | 63000 | 63000.0000 |
| 1004 | 47000 | 47000.0000 |
| 1005 | 50000 | 50000.0000 |
| 1006 | 65000 | 65000.0000 |
| 1007 | 41000 | 41000.0000 |
| 1008 | 48000 | 48000.0000 |
| 1009 | 42000 | 42000.0000 |
+------------+--------+--------------+
- GROUP BY won’t work like it does with using a Subquery or Partition By
Subquery in the FROM statement
SELECT EmployeeID, AllAvgSalary
FROM
(SELECT EmployeeID, Salary, AVG(Salary) OVER () AS AllAvgSalary
FROM EmployeeSalary) A
ORDER BY A.EmployeeID
Note: A is just an alias, Every derived table must have its own alias!!
- Creating a table and querying off of it!
- Have to write out a subquery each time
Lets look at the Inner Query first
SELECT EmployeeID, Salary, AVG(Salary) OVER () AS AllAvgSalary
FROM EmployeeSalary
Output:
+------------+--------+--------------+
| EmployeeID | Salary | AllAvgSalary |
+------------+--------+--------------+
| 1001 | 45000 | 48555.5556 |
| 1002 | 36000 | 48555.5556 |
| 1003 | 63000 | 48555.5556 |
| 1004 | 47000 | 48555.5556 |
| 1005 | 50000 | 48555.5556 |
| 1006 | 65000 | 48555.5556 |
| 1007 | 41000 | 48555.5556 |
| 1008 | 48000 | 48555.5556 |
| 1009 | 42000 | 48555.5556 |
+------------+--------+--------------+
Now lets look at the full output of:
SELECT EmployeeID, AllAvgSalary
FROM
(SELECT EmployeeID, Salary, AVG(Salary) OVER () AS AllAvgSalary
FROM EmployeeSalary) A
ORDER BY A.EmployeeID
Output:
+------------+--------------+
| EmployeeID | AllAvgSalary |
+------------+--------------+
| 1001 | 48555.5556 |
| 1002 | 48555.5556 |
| 1003 | 48555.5556 |
| 1004 | 48555.5556 |
| 1005 | 48555.5556 |
| 1006 | 48555.5556 |
| 1007 | 48555.5556 |
| 1008 | 48555.5556 |
| 1009 | 48555.5556 |
+------------+--------------+
Subquery in WHERE Statement
- We want age, but age is in another table, so we write a subquery to return.
Lets look at the inside query output
SELECT EmployeeID, JobTitle, Salary
FROM EmployeeSalary
WHERE EmployeeID IN (
SELECT EmployeeID
FROM EmployeeDemographics
WHERE Age > 30);
SELECT EmployeeID
FROM EmployeeDemographics
WHERE Age > 30;
Output:
+------------+
| EmployeeID |
+------------+
| 1004 |
| 1005 |
| 1006 |
| 1007 |
| 1008 |
| 1009 |
+------------+
Now Lets run the Full Query Above:
Output:
+------------+--------------------+--------+
| EmployeeID | JobTitle | Salary |
+------------+--------------------+--------+
| 1004 | Accountant | 47000 |
| 1005 | HR | 50000 |
| 1006 | Regional Manager | 65000 |
| 1007 | Supplier Relations | 41000 |
| 1008 | Salesman | 48000 |
| 1009 | Accountant | 42000 |
+------------+--------------------+--------+
Temp Tables
Windows Function
A MySQL window function is a function that uses basic queries to manipulate row values. Window functions must have an OVER clause.
Sales Tables
+----------------+-------------+--------+
| sales_employee | fiscal_year | sale |
+----------------+-------------+--------+
| Alice | 2016 | 150.00 |
| Alice | 2017 | 100.00 |
| Alice | 2018 | 200.00 |
| Bob | 2016 | 100.00 |
| Bob | 2017 | 150.00 |
| Bob | 2018 | 200.00 |
| John | 2016 | 200.00 |
| John | 2017 | 150.00 |
| John | 2018 | 250.00 |
+----------------+-------------+--------+
Remember our aggregate functions which are functions that performs a calculation on multiple values and returns a single value.
Always use the over function with a Window Function
Window Function Syntax:
window_function_name(expression) OVER (
[partition_defintion]
[order_definition]
[frame_definition]
)
Lets use our Sales table in a example:
SELECT
fiscal_year,
sales_employee,
sale,
SUM(sale) OVER (PARTITION BY fiscal_year) total_sales
FROM
sales;
Output
+-------------+----------------+--------+-------------+
| fiscal_year | sales_employee | sale | total_sales |
+-------------+----------------+--------+-------------+
| 2016 | Alice | 150.00 | 450.00 |
| 2016 | Bob | 100.00 | 450.00 |
| 2016 | John | 200.00 | 450.00 |
| 2017 | Alice | 100.00 | 400.00 |
| 2017 | Bob | 150.00 | 400.00 |
| 2017 | John | 150.00 | 400.00 |
| 2018 | Alice | 200.00 | 650.00 |
| 2018 | Bob | 200.00 | 650.00 |
| 2018 | John | 250.00 | 650.00 |
+-------------+----------------+--------+-------------+
Another Window Function Example with row_number
Lets say we have this table below:
+------+------+
| id | name |
+------+------+
| 1 | A |
| 2 | B |
| 2 | B |
| 3 | C |
| 3 | C |
| 3 | C |
| 4 | D |
+------+------+
Lets use row_number in a window function to divide the rows into partitions by all columns. The row number will restart for each unique set of rows.
SELECT
id,
name,
ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY id) AS row_num
FROM t;
Output:
+------+------+---------+
| id | name | row_num |
+------+------+---------+
| 1 | A | 1 |
| 2 | B | 1 |
| 2 | B | 2 |
| 3 | C | 1 |
| 3 | C | 2 |
| 3 | C | 3 |
| 4 | D | 1 |
+------+------+---------+
We can see that row_num is assigned (1,2 , and 3) to each name
Another row_number Example:
Person Table:
+---------+----------+---------+------+
| Name | Product | Country | Year |
+---------+----------+---------+------+
| Stephen | Computer | USA | 2015 |
| Joseph | Laptop | India | 2016 |
| John | TV | USA | 2016 |
| Donald | Laptop | England | 2015 |
| Joseph | Mobile | India | 2015 |
| Peter | Mouse | England | 2016 |
+---------+----------+---------+------+
We can use the ROW_NUMBER() function to assign a sequence number for each record
SELECT *,
ROW_NUMBER() OVER(PARTITION BY Year) AS row_num
FROM Person;
Output
+---------+----------+---------+------+---------+
| Name | Product | Country | Year | row_num |
+---------+----------+---------+------+---------+
| Stephen | Computer | USA | 2015 | 1 |
| Donald | Laptop | England | 2015 | 2 |
| Joseph | Mobile | India | 2015 | 3 |
| Joseph | Laptop | India | 2016 | 1 |
| John | TV | USA | 2016 | 2 |
| Peter | Mouse | England | 2016 | 3 |
+---------+----------+---------+------+---------+
Rank
Rank returns the position of any row inside the partition.
Lets look at our Employee Table
+-------------+-----------------+------------+-----------+
| employee_id | full_name | department | salary |
+-------------+-----------------+------------+-----------+
| 100 | Mary Johns | SALES | 1000.0000 |
| 101 | Sean Moldy | IT | 1500.0000 |
| 102 | Peter Dugan | SALES | 2000.0000 |
| 103 | Lilian Penn | SALES | 1700.0000 |
| 104 | Milton Kowarsky | IT | 1800.0000 |
| 105 | Mareen Bisset | ACCOUNTS | 1200.0000 |
| 106 | Airton Graue | ACCOUNTS | 1100.0000 |
+-------------+-----------------+------------+-----------+
Lets start using the window function RANK and rank salaries within the departments.
SELECT
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_ranking,
department,
employee_id,
full_name,
salary
FROM Employee;
Output:
+--------------+------------+-------------+-----------------+-----------+
| dept_ranking | department | employee_id | full_name | salary |
+--------------+------------+-------------+-----------------+-----------+
| 1 | ACCOUNTS | 105 | Mareen Bisset | 1200.0000 |
| 2 | ACCOUNTS | 106 | Airton Graue | 1100.0000 |
| 1 | IT | 104 | Milton Kowarsky | 1800.0000 |
| 2 | IT | 101 | Sean Moldy | 1500.0000 |
| 1 | SALES | 102 | Peter Dugan | 2000.0000 |
| 2 | SALES | 103 | Lilian Penn | 1700.0000 |
| 3 | SALES | 100 | Mary Johns | 1000.0000 |
+--------------+------------+-------------+-----------------+-----------+