allows creating segments that will aggregate independent from one another
allows to take the sum of data limited to each account rather than across the entire dataset
goes in between the WHERE and ORDER clause
you should always see any columns in the SELECT statement that are not being aggregated on, in the GROUP BY statement
The key takeaways here:
GROUP BY can be used to aggregate data within subsets of the data. For example, grouping for different accounts, different regions, or different sales representatives.
Any column in the SELECT statement that is not within an aggregator must be in the GROUP BY clause.
The GROUP BY always goes between WHERE and ORDER BY.
ORDER BY works like SORT in spreadsheet software.
GROUP BY - Expert Tip
Before we dive deeper into aggregations using GROUP BY statements, it is worth noting that SQL evaluates the aggregations before the LIMIT clause. If you don’t group by any columns, you’ll get a 1-row result—no problem there. If you group by a column with enough unique values that it exceeds the LIMIT number, the aggregates will be calculated, and then some rows will simply be omitted from the results.
This is actually a nice way to do things because you know you’re going to get the correct aggregates. If SQL cuts the table down to 100 rows, then performed the aggregations, your results would be substantially different. The above query’s results exceed 100 rows, so it’s a perfect example. In the next concept, use the SQL environment to try removing the LIMIT and running it again to see what changes.
Quiz
Which account (by name) placed the earliest order? Your solution should have the account name and the date of the order.
SELECT a.name, o.occurred_at
FROM accounts a
JOIN orders o
ON a.id = o.account_id
ORDER BY o.occurred_at DESC
LIMIT 1;
Answer Key:
SELECT a.name, o.occurred_at
FROM accounts a
JOIN orders o
ON a.id = o.account_id
ORDER BY occurred_at
LIMIT 1;
I learned here that "earliest" is the opposite of "most recent."</code></pre>
<!-- /wp:code -->
Find the total sales in usd for each account. You should include two columns - the total sales for each company's orders in usd and the company name.
SELECT a.name, SUM(o.total_amt_usd) AS TotalPerAccount
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name;
Via what channel did the most recent (latest) web_event occur, which account was associated with this web_event? Your query should return only three values - the date, channel, and account name.
SELECT a.name AS AccountName, w.channel AS Channel, w.occurred_at AS Date
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
ORDER BY w.occurred_at DESC
LIMIT 1;
Molina Healthcare</code></pre>
<!-- /wp:code -->
Find the total number of times each type of channel from the web_events was used. Your final table should have two columns - the channel and the number of times the channel was used.
SELECT w.channel AS Channel, COUNT(w.channel) AS count
FROM web_events w
GROUP BY Channel;
Answer Key:
SELECT w.channel, COUNT(*)
FROM web_events w
GROUP BY w.channel
Who was the primary contact associated with the earliest web_event?
SELECT a.primary_poc AS PrimaryContact
FROM web_events w
JOIN accounts a
ON a.id = w.account_id
ORDER BY w.occurred_at DESC
LIMIT 1;
Hilde Klopfer
I learned again here that "earliest" is the opposite of "the most recent", hence should have been ASC and not DESC.</code></pre>
<!-- /wp:code -->
What was the smallest order placed by each account in terms of total usd. Provide only two columns - the account name and the total usd. Order from smallest dollar amounts to largest.
SELECT a.name AS Name, SUM(o.total_amt_usd) AS TotalUSD
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY Name
ORDER BY TotalUSD;
Nike 390.25
This is in total per account.
If I get the minimum to get the smallest order placed by an account:
SELECT a.name AS Name, min(o.total_amt_usd) AS TotalUSD
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY Name
ORDER BY TotalUSD;
My second answer is correct. The missing value is the amount of smallest order placed of which account, and NOT the account with smallest orders in total.
Answer Key:
SELECT a.name, MIN(total_amt_usd) smallest_order
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY smallest_order;
Find the number of sales reps in each region. Your final table should have two columns - the region and the number of sales_reps. Order from the fewest reps to most reps.
SELECT r.name AS Region, COUNT(s.name) AS SalesReps
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
GROUP BY Region
ORDER BY SalesReps;
region
salesreps
Midwest
9
Southeast
10
West
10
Northeast
21
Key takeaways:
You can GROUP BY multiple columns at once, as we showed here. This is often useful to aggregate across a number of different segments.
The order of columns listed in the ORDER BY clause does make a difference. You are ordering the columns from left to right.
GROUP BY - Expert Tips
The order of column names in your GROUP BY clause doesn’t matter—the results will be the same regardless. If we run the same query and reverse the order in the GROUP BY clause, you can see we get the same results.
As with ORDER BY, you can substitute numbers for column names in the GROUP BY clause. It’s generally recommended to do this only when you’re grouping many columns, or if something else is causing the text in the GROUP BY clause to be excessively long.
A reminder here that any column that is not within an aggregation must show up in your GROUP BY statement. If you forget, you will likely get an error. However, in the off chance that your query does work, you might not like the results!
SELECT account_id,
channel,
COUNT(id) as events
FROM web_events
GROUP BY account_id, channel
ORDER BY account_id DESC
account_id
channel
events
4501
direct
8
4501
organic
1
4491
organic
5
4491
direct
26
4491
banner
3
4491
adwords
12
4491
twitter
3
4491
facebook
8
4481
adwords
1
4481
organic
1
Quiz
For each account, determine the average amount of each type of paper they purchased across their orders. Your result should have four columns - one for the account name and one for the average quantity purchased for each of the paper types for each account.
SELECT a.name AS Name, AVG(o.standard_qty) AS Standard, AVG(o.gloss_qty) AS Gloss, AVG(o.poster_qty) AS Poster
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY Name;
Answer Key:
SELECT a.name, AVG(o.standard_qty) avg_stand, AVG(o.gloss_qty) avg_gloss, AVG(o.poster_qty) avg_post
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name;
name
standard
gloss
poster
Boeing
360.2666666666666667
222.3333333333333333
107.9333333333333333
Western Digital
192.7846153846153846
239.8307692307692308
170.2615384615384615
Sysco
191.0588235294117647
231.1764705882352941
173.8676470588235294
Southern
347.5000000000000000
27.7500000000000000
24.9000000000000000
Altria Group
381.0980392156862745
22.4901960784313725
25.5686274509803922
Energy Transfer Equity
222.1666666666666667
158.5000000000000000
118.3333333333333333
Aramark
120.2500000000000000
126.8750000000000000
143.0000000000000000
Cognizant Technology Solutions
253.8666666666666667
260.8666666666666667
116.5333333333333333
Time Warner
254.3333333333333333
246.8888888888888889
91.3333333333333333
For each account, determine the average amount spent per order on each paper type. Your result should have four columns - one for the account name and one for the average amount spent on each paper type.
SELECT a.name AS Name, AVG(o.standard_amt_usd) AS Standard, AVG(o.gloss_amt_usd) AS Gloss, AVG(o.poster_amt_usd) AS Poster
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY Name;
Answer Key:
SELECT a.name, AVG(o.standard_amt_usd) avg_stand, AVG(o.gloss_amt_usd) avg_gloss, AVG(o.poster_amt_usd) avg_post
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name;
name
standard
gloss
poster
Boeing
1797.7306666666666667
1665.2766666666666667
876.4186666666666667
Western Digital
961.9952307692307692
1796.3324615384615385
1382.5236923076923077
Sysco
953.3835294117647059
1731.5117647058823529
1411.8052941176470588
Southern
1734.0250000000000000
207.8475000000000000
202.1880000000000000
Altria Group
1901.6792156862745098
168.4515686274509804
207.6172549019607843
Energy Transfer Equity
1108.6116666666666667
1187.1650000000000000
960.8666666666666667
Aramark
600.0475000000000000
950.2937500000000000
1161.1600000000000000
Cognizant Technology Solutions
1266.7946666666666667
1953.8913333333333333
946.2506666666666667
Time Warner
1269.1233333333333333
1849.1977777777777778
741.6266666666666667
Marsh & McLennan
868.2600000000000000
1896.8425000000000000
1299.2000000000000000
3M
1563.8303571428571429
2095.3275000000000000
910.3100000000000000
ConAgra Foods
880.2360000000000000
1806.5880000000000000
1141.6720000000000000
National Oilwell Varco
842.5707407407407407
1778.4588888888888889
2242.6237037037037037
Cummins
1033.4290000000000000
1390.8930000000000000
1059.6600000000000000
Unum Group
1696.6000000000000000
586.7166666666666667
242.6977777777777778
Twenty-First Century Fox
476.2677777777777778
408.0662962962962963
673.6592592592592593
Determine the number of times a particular channel was used in the web_events table for each sales rep. Your final table should have three columns - the name of the sales rep, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.
SELECT s.name AS SalesRepName, w.channel AS Channel, COUNT(w.channel) AS ChannelCount
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN web_events w
ON a.id = w.account_id
GROUP BY SalesRepName, Channel
ORDER BY ChannelCount DESC;
Answer Key:
SELECT s.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name, w.channel
ORDER BY num_events DESC;
salesrepname
channel
channelcount
Earlie Schleusner
direct
234
Vernita Plump
direct
232
Moon Torian
direct
194
Georgianna Chisholm
direct
188
Tia Amato
direct
185
Maren Musto
direct
184
Nelle Meaux
direct
179
Maryanna Fiorentino
direct
168
Dorotha Seawell
direct
161
Determine the number of times a particular channel was used in the web_events table for each region. Your final table should have three columns - the region name, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.
SELECT r.name AS RegionName, w.channel AS Channel, COUNT(w.channel) AS ChannelCount
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN web_events w
ON a.id = w.account_id
GROUP BY RegionName, Channel
ORDER BY ChannelCount DESC;
Answer Key:
SELECT r.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name, w.channel
ORDER BY num_events DESC;
regionname
channel
channelcount
Northeast
direct
1800
Southeast
direct
1548
West
direct
1254
Midwest
direct
696
Northeast
facebook
335
Northeast
organic
317
Northeast
adwords
300
DISTINCT is always used in SELECT statements, and it provides the unique rows for all columns written in the SELECT statement. Therefore, you only use DISTINCT once in any particular SELECT statement.
You could write:
SELECTDISTINCT column1, column2, column3
FROM table1;
which would return the unique (or DISTINCT) rows across all three columns.
You would not write:
SELECTDISTINCT column1, DISTINCT column2, DISTINCT column3
FROM table1;
You can think of DISTINCT the same way you might think of the statement "unique".
DISTINCT - Expert Tip
It’s worth noting that using DISTINCT, particularly in aggregations, can slow your queries down quite a bit.
SELECT account_id,
channel
FROM web_events
GROUPBY account_id, channel
ORDERBY account_id
Query 3:
SELECTDISTINCT account_id,
channel
FROM web_events
ORDERBY account_id
Quiz
Use DISTINCT to test if there are any accounts associated with more than one region.
SELECT DISTINCT r.name AS Region, a.name AS Accounts
FROM region r
JOIN sales_reps s
ON r.id = s.region_id
JOIN accounts a
ON s.id = a.sales_rep_id
ORDER BY Accounts;
Answer:
SELECT DISTINCT id, name
FROM accounts;
What happened here?
This is the simplest way to code this. There's no need to join different tables, as the number of rows will already return the value. Therefore, will only have to count the distinct account id and account name. </code></pre>
<!-- /wp:code -->
Answer Key:
The below two queries have the same number of resulting rows (351), so we know that every account is associated with only one region. If each account was associated with more than one region, the first query should have returned more rows than the second query.
SELECT a.id as "account id", r.id as "region id",
a.name as "account name", r.name as "region name"
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
JOIN region r
ON r.id = s.region_id;
and
SELECT DISTINCT id, name
FROM accounts;</code></pre>
<!-- /wp:code -->
Have any sales reps worked on more than one account?
SELECT DISTINCT s.name AS SalesRep, a.name AS Accounts
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
ORDER BY SalesRep;
Answer Key:
Actually, all of the sales reps have worked on more than one account. The fewest number of accounts any sales rep works on is 3. There are 50 sales reps, and they all have more than one account. Using DISTINCT in the second query assures that all of the sales reps are accounted for in the first query.
SELECT s.id, s.name, COUNT(*) num_accounts
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.id, s.name
ORDER BY num_accounts;
and
SELECT DISTINCT id, name
FROM sales_reps;</code></pre>
<!-- /wp:code -->
SQL HAVING
HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery. Essentially, any time you want to perform a WHERE on an element of your query that was created by an aggregate, you need to use HAVING instead.
Query 1:
SELECT account_id,
SUM(total_amt_usd) AS sum_total_amt_usd
FROM orders
GROUPBY 1
ORDERBY 2 DESC
Query 2: Results in an Error
SELECT account_id,
SUM(total_amt_usd) AS sum_total_amt_usd
FROM orders
WHERESUM(total_amt_usd) >= 250000
GROUPBY 1
ORDERBY 2 DESC
Query 3:
SELECT account_id,
SUM(total_amt_usd) AS sum_total_amt_usd
FROM orders
GROUPBY 1
HAVINGSUM(total_amt_usd) >= 250000
WHERE subsets the returned data based on a logical condition.
WHERE appears after the FROM, JOIN, and ON clauses, but before GROUP BY.
HAVING appears after the GROUP BY clause, but before the ORDER BY clause
HAVING is like WHERE, but it works on logical statements involving aggregations.
Quiz
How many of the sales reps have more than 5 accounts that they manage?
SELECT sales_rep_id, COUNT(name)
FROM accounts
GROUP BY sales_rep_id
HAVING COUNT(name) > 5;
sales_rep_id
count
321740
7
321980
11
321950
9
321580
6
321840
11
321520
7
321680
7
I am out here...
Answer Key:
SELECT s.id, s.name, COUNT(*) num_accounts
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUPBY s.id, s.nameHAVINGCOUNT(*) > 5
ORDERBY num_accounts;
and technically, we can get this using a SUBQUERY as shown below. This same logic can be used for the other queries, but this will not be shown.
SELECTCOUNT(*) num_reps_above5
FROM(SELECT s.id, s.name, COUNT(*) num_accounts
FROM accounts a
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUPBY s.id, s.nameHAVINGCOUNT(*) > 5
ORDERBY num_accounts) AS Table1;
My answer didn't directly show the answer to the question, while this second query from the Answer Key does. What happened there?
On the second query, it seems like it first counted the number of sales_rep more then 5 in a parenthesis (subquery), then on the SELECT created one table to display the total. I have to try this to the second problem below:
How many accounts have more than 20 orders?
SELECT COUNT(*) TotalOrders
FROM (SELECT account_id, COUNT(id)
FROM orders
GROUP BY account_id
HAVING COUNT(id) > 20) AS Table1;
totalorders
120
Answer Key
SELECT a.id, a.name, COUNT(*) num_orders
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUPBY a.id, a.nameHAVINGCOUNT(*) > 20
ORDERBY num_orders;
In their answer here, they didn't use a subquery. 😞 It returned a table, but have to look at the total to see and still matched with my answer - 120.
Which account has the most orders?
SELECT account_id, COUNT(id)
FROM orders
GROUP BY account_id
ORDER BY COUNT(id) DESC
LIMIT 1;
SELECT a.name, COUNT(o.id)
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY COUNT(o.id) DESC
LIMIT 1;
name
count
Leucadia National
71
Answer Key:
SELECT a.id, a.name, COUNT(*) num_orders
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUPBY a.id, a.nameORDERBY num_orders DESCLIMIT 1;
I think my answer is still correct, except they added the account ID column.
Which accounts spent more than 30,000 usd total across all orders?
SELECT a.name, SUM(o.total_amt_usd)
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
HAVING SUM(o.total_amt_usd) > '30000'
ORDER BY a.name;
name
sum
3M
127945.10
Abbott Laboratories
96819.92
ADP
163579.18
Aetna
237781.30
Aflac
117862.77
Allstate
96779.18
Ally Financial
37653.96
Answer Key:
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.id, a.name
HAVING SUM(o.total_amt_usd) > 30000
ORDER BY total_spent;
My answer is still correct, except they sorted by total USD spent, and added Account ID column.
Which accounts spent less than 1,000 usd total across all orders?
SELECT a.name, SUM(o.total_amt_usd)
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
HAVING SUM(o.total_amt_usd) < '1000'
ORDER BY a.name;
name
sum
Delta Air Lines
859.64
Level 3 Communications
881.73
Nike
390.25
Answer Key:
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUPBY a.id, a.nameHAVINGSUM(o.total_amt_usd) < 1000
ORDERBY total_spent;
My answer is still correct, except they sorted by total usd spent and added account ID column.
Which account has spent the most with us?
SELECT a.name, SUM(o.total_amt_usd)
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY SUM(o.total_amt_usd) DESC
LIMIT 1;
name
sum
EOG Resources
382873.30
Answer Key:
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUPBY a.id, a.nameORDERBY total_spent DESCLIMIT 1;
My answer is still correct, except they added Account ID column.
Which account has spent the least with us?
SELECT a.name, SUM(o.total_amt_usd)
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY SUM(o.total_amt_usd) ASC
LIMIT 1;
name
sum
Nike
390.25
Answer Key:
SELECT a.id, a.name, SUM(o.total_amt_usd) total_spent
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUPBY a.id, a.nameORDERBY total_spent
LIMIT 1;
My answer is correct, except they added Account ID column and I forgot that I didn't have to add ASC.
Which accounts used facebook as a channel to contact customers more than 6 times?
SELECT a.name, w.channel, COUNT(w.channel)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
WHERE w.channel = 'facebook'
GROUP BY a.name, w.channel
HAVING COUNT(w.channel) > 6;
name
channel
count
Core-Mark Holding
facebook
8
Cameron International
facebook
9
ADP
facebook
9
Massachusetts Mutual Life Insurance
facebook
9
BlackRock
facebook
10
PayPal Holdings
facebook
8
eBay
facebook
7
Answer Key:
SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
GROUPBY a.id, a.name, w.channel
HAVINGCOUNT(*) > 6 AND w.channel = 'facebook'
ORDERBY use_of_channel;
My query returned the same results, but two things I learned here:
I didn't have to use the WHERE clause, as I could have used AND clause with HAVING clause to filter facebook.
HAVING clause also works with categorical data.
Which account used facebook most as a channel?
SELECT a.name, w.channel, COUNT(w.channel)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
WHERE w.channel = 'facebook'
GROUP BY a.name, w.channel
ORDER BY COUNT(w.channel) DESC
LIMIT 1;
name
channel
count
Gilead Sciences
facebook
16
Answer Key:
SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
WHERE w.channel = 'facebook'
GROUPBY a.id, a.name, w.channel
ORDERBY use_of_channel DESCLIMIT 1;
Note: This query above only works if there are no ties for the account that used facebook the most. It is a best practice to use a larger limit number first such as 3 or 5 to see if there are ties before using LIMIT 1.
Which channel was most frequently used by most accounts?
SELECT a.name, w.channel, COUNT(w.channel)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
GROUP BY a.name, w.channel
ORDER BY COUNT(w.channel) DESC
LIMIT 1;
name
channel
count
Leucadia National
direct
52
Answer Key:
SELECT a.id, a.name, w.channel, COUNT(*) use_of_channel
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
GROUP BY a.id, a.name, w.channel
ORDER BY use_of_channel DESC
LIMIT 10;
My answer is correct, "direct", the instructions didn't specify to display data for 10 accounts.
SQL DATE Functions
GROUPing BY a date column is not usually very useful in SQL, as these columns tend to have transaction data down to a second. Keeping date information at such granular levels is both a blessing and a curse, as it gives really precise information (a blessing), but it makes grouping information together directly difficult (a curse).
DATE_TRUNC
DATE_TRUNC allows you to truncate your date to a particular part of your date-time column. Common truncations are day, month, and year. Here is a great blog post by Mode Analytics on the power of this function.
DATE_PART
DATE_PART can be useful for pulling a specific portion of a date, but notice pulling month or day of the week (dow) means that you are no longer keeping the years in order. Rather you are grouping for certain components regardless of which year they belonged in.
For additional functions you can use with dates, check out the documentation here, but the DATE_TRUNC and DATE_PART functions definitely give you a great start!
You can reference the columns in your select statement in GROUP BY and ORDER BY clauses with numbers that follow the order they appear in the select statement. For example
SELECT standard_qty, COUNT(*)
FROM orders
GROUP BY 1 (this 1 refers to standard_qty since it is the first of the columns included in the select statement)
ORDER BY 1 (this 1 refers to standard_qty since it is the first of the columns included in the select statement)
SELECT DATE_PART('dow',occurred_at) AS day_of_week,
account_id,
occurred_at,
total
FROM orders
SELECT DATE_PART('dow',occurred_at) AS day_of_week,
SUM(total) AS total_qty
FROM orders
GROUPBY 1
ORDERBY 2
dow - day of the week
Quiz
Find the sales in terms of total dollars for all orders in each year, ordered from greatest to least. Do you notice any trends in the yearly sales totals?
SELECT DATE_PART('year', occurred_at) AS Year, SUM(total_amt_usd)
FROM orders
GROUP BY DATE_PART('year', occurred_at)
ORDER BY SUM(total_amt_usd) DESC;
year
sum
2016
12864917.92
2015
5752004.94
2014
4069106.54
2013
377331.00
2017
78151.43
There's a continued yearly growth from 2013-2016, but it suddenly fell down to it's lowest in 2017.
When we look at the yearly totals, you might notice that 2013 and 2017 have much smaller totals than all other years. If we look further at the monthly data, we see that for 2013 and 2017 there is only one month of sales for each of these years (12 for 2013 and 1 for 2017). Therefore, neither of these is evenly represented. Sales have been increasing year over year, with 2016 being the largest sales to date. At this rate, we might expect 2017 to have the largest sales.
Which month did Parch & Posey have the greatest sales in terms of total dollars? Are all months evenly represented by the dataset?
SELECT DATE_TRUNC('month', occurred_at) AS Month, SUM(total_amt_usd)
FROM orders
GROUP BY DATE_TRUNC('month', occurred_at)
ORDER BY SUM(total_amt_usd) DESC;
*Month specific to a year.
Answer is December of 2016.</code></pre>
<!-- /wp:code -->
month
sum
2016-12-01T00:00:00.000Z
1770282.62
SELECT DATE_PART('month', occurred_at) AS Month, SUM(total_amt_usd)
FROM orders
GROUP BY DATE_PART('month', occurred_at)
ORDER BY SUM(total_amt_usd) DESC;
*Month regardless of the year. </code></pre>
<!-- /wp:code -->
month
sum
12
3129411.98
Answer Key:
SELECT DATE_PART('month', occurred_at) ord_month, SUM(total_amt_usd) total_spent
FROM orders
WHERE occurred_at BETWEEN '2014-01-01' AND '2017-01-01'
GROUPBY 1
ORDERBY 2 DESC;
They had to remove the data from 2013 and 2017 to be fair, as there were only one order for each of those years. The data for the years were not evenly represented.
Which year did Parch & Posey have the greatest sales in terms of the total number of orders? Are all years evenly represented by the dataset?
SELECT DATE_PART('year', occurred_at) AS Year, COUNT(id) AS TotalOrders
FROM orders
GROUP BY DATE_PART('year', occurred_at)
ORDER BY COUNT(id) DESC;
Which month did Parch & Posey have the greatest sales in terms of the total number of orders? Are all months evenly represented by the dataset?
SELECT DATE_PART('month', occurred_at) AS Month, COUNT(id) AS TotalOrders
FROM orders
GROUP BY DATE_PART('month', occurred_at)
ORDER BY COUNT(id) DESC;
*Month regardless of Year</code></pre>
<!-- /wp:code -->
month
totalorders
12
882
SELECT DATE_TRUNC('month', occurred_at) AS Month, COUNT(id) AS TotalOrders
FROM orders
GROUP BY DATE_TRUNC('month', occurred_at)
ORDER BY COUNT(id) DESC;
*Month specific to the year, December of 2016. </code></pre>
<!-- /wp:code -->
month
totalorders
2016-12-01T00:00:00.000Z
463
In which month of which year did Walmart spend the most on gloss paper in terms of dollars?
SELECT DATE_TRUNC('month', occurred_at) AS Month, SUM(gloss_amt_usd) AS Total_Gloss_in_USD
FROM orders
GROUP BY DATE_TRUNC('month', occurred_at)
ORDER BY SUM(gloss_amt_usd) DESC;
*They specified that they need the month of which year.</code></pre>
<!-- /wp:code -->
month
total_gloss_in_usd
2016-12-01T00:00:00.000Z
506825.83
Answer Key:
SELECT DATE_TRUNC('month', o.occurred_at) ord_date, SUM(o.gloss_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
WHERE a.name = 'Walmart'
GROUPBY 1
ORDERBY 2 DESCLIMIT 1;
I am completely wrong here as I missed the condition - Walmart.
CASE Statements
Derive
take data from existing columns and modify them
CASE
handles "IF" "THEN" logic
must end with the word "END"
ELSE
captures values not specified in "WHEN" and "THEN" statements
SELECT account_id,
occurred_at,
total,
CASE WHEN total > 500 THEN 'Over 500'
WHEN total > 300 AND total <= 500 THEN '301 - 500'
WHEN total > 100 AND total <=300 THEN '101 - 300'
ELSE '100 or under' END AS total_group
FROM orders
THEN adds the value to enter when conditions are met on the column added via END AS.
account_id
occurred_at
total
total_group
1001
2015-10-06T17:31:14.000Z
169
101 - 300
1001
2015-11-05T03:34:33.000Z
288
101 - 300
1001
2015-12-04T04:21:55.000Z
132
101 - 300
1001
2016-01-02T01:18:24.000Z
176
101 - 300
1001
2016-02-01T19:27:27.000Z
165
101 - 300
1001
2016-03-02T15:29:32.000Z
173
101 - 300
CASE - Expert Tip
The CASE statement always goes in the SELECT clause.
CASE must include the following components: WHEN, THEN, and END. ELSE is an optional component to catch cases that didn’t meet any of the other previous CASE conditions.
You can make any conditional statement using any conditional operator (like WHERE) between WHEN and THEN. This includes stringing together multiple conditional statements using AND and OR.
You can include multiple WHEN statements, as well as an ELSE statement again, to deal with any unaddressed conditions.
Example
In a quiz question in the previous Basic SQL lesson, you saw this question:
Create a column that divides the standard_amt_usd by the standard_qty to find the unit price for standard paper for each order. Limit the results to the first 10 orders, and include the id and account_id fields. NOTE - you will be thrown an error with the correct solution to this question. This is for a division by zero. You will learn how to get a solution without an error to this query when you learn about CASE statements in a later section.
Let's see how we can use the CASE statement to get around this error.
SELECTid, account_id, standard_amt_usd/standard_qty AS unit_price
FROM orders
LIMIT 10;
Now, let's use a CASE statement. This way any time the standard_qty is zero, we will return 0, and otherwise, we will return the unit_price.
SELECT account_id, CASEWHEN standard_qty = 0 OR standard_qty IS NULL THEN 0
ELSE standard_amt_usd/standard_qty ENDAS unit_price
FROM orders
LIMIT 10;
Now the first part of the statement will catch any of those divisions by zero values that were causing the error, and the other components will compute the division as necessary. You will notice, we essentially charge all of our accounts 4.99 for standard paper. It makes sense this doesn't fluctuate, and it is more accurate than adding 1 in the denominator like our quick fix might have been in the earlier lesson.
You can try it yourself using the environment below.
Query 1:
SELECTid,
account_id,
occurred_at,
channel,
CASEWHEN channel = 'facebook' THEN 'yes' ENDAS is_facebook
FROM web_events
ORDERBY occurred_at
Query 2:
SELECTid,
account_id,
occurred_at,
channel,
CASEWHEN channel = 'facebook' THEN 'yes' ELSE 'no' ENDAS is_facebook
FROM web_events
ORDERBY occurred_at
Query 3:
SELECTid,
account_id,
occurred_at,
channel,
CASEWHEN channel = 'facebook' OR channel = 'direct' THEN 'yes'
ELSE 'no' ENDAS is_facebook
FROM web_events
ORDERBY occurred_at
Query 4:
SELECT account_id,
occurred_at,
total,
CASEWHEN total > 500 THEN 'Over 500'
WHEN total > 300 THEN '301 - 500'
WHEN total > 100 THEN '101 - 300'
ELSE '100 or under' ENDAS total_group
FROM orders
CASE & Aggregations
SELECT CASE WHEN total > 500 THEN 'Over 500'
ELSE '500 or under' END AS total_group,
COUNT(*) AS order_count
FROM orders
GROUP BY 1
total_group
order_count
500 or under
3716
Over 500
3196
This is the same as:
SELECT CASE WHEN total > 500 THEN 'Over 500'
ELSE '500 or under' END AS total_group,
COUNT(*) AS order_count
FROM orders
GROUP BY CASE WHEN total > 500 THEN 'Over 500'
ELSE '500 or under' END;
Quiz
Write a query to display for each order, the account ID, the total amount of the order, and the level of the order - ‘Large’ or ’Small’ - depending on if the order is $3000 or more, or smaller than $3000.
SELECT account_id, SUM(total_amt_usd), CASE WHEN SUM(total_amt_usd) >= 3000 THEN 'Large' ELSE 'Small' END AS Level
FROM orders
GROUP BY account_id
ORDER BY SUM(total_amt_usd) DESC;
account_id
sum
level
4211
382873.30
Large
4151
345618.59
Large
1301
326819.48
Large
1871
300694.79
Large
4111
293861.14
Large
3411
291047.25
Large
I think this is the correct answer:
SELECT account_id, total_amt_usd, CASE WHEN total_amt_usd >= 3000 THEN 'Large' ELSE 'Small' END AS Level
FROM orders
GROUP BY CASE WHEN total_amt_usd >= 3000 THEN 'Large' ELSE 'Small' END, total_amt_usd, account_id
ORDER BY total_amt_usd DESC
*It asks for for each order and not for orders in total per account.
Answer Key:
SELECT account_id, total_amt_usd,
CASEWHEN total_amt_usd > 3000 THEN 'Large'
ELSE 'Small' ENDAS order_level
FROM orders;
Write a query to display the number of orders in each of three categories, based on the total number of items in each order. The three categories are: 'At Least 2000', 'Between 1000 and 2000' and 'Less than 1000'.
SELECT CASE WHEN total >= 2000 THEN 'At Least 2000'
WHEN total BETWEEN 1000 AND 2000 THEN 'Between 1000 and 2000'
WHEN total < 1000 THEN 'Less than 1000' END AS Category,
COUNT(*) Num_Orders
FROM orders
GROUP BY CASE WHEN total >= 2000 THEN 'At Least 2000'
WHEN total BETWEEN 1000 AND 2000 THEN 'Between 1000 and 2000'
WHEN total < 1000 THEN 'Less than 1000' END;
category
num_orders
At Least 2000
70
Between 1000 and 2000
511
Less than 1000
6331
Answer Key:
SELECTCASEWHEN total >= 2000 THEN 'At Least 2000'
WHEN total >= 1000 AND total < 2000 THEN 'Between 1000 and 2000'
ELSE 'Less than 1000' ENDAS order_category,
COUNT(*) AS order_count
FROM orders
GROUPBY 1;
This returns the same value.
We would like to understand 3 different levels of customers based on the amount associated with their purchases. The top-level includes anyone with a Lifetime Value (total sales of all orders) greater than 200,000 usd. The second level is between 200,000 and 100,000 usd. The lowest level is anyone under 100,000 usd. Provide a table that includes the level associated with each account. You should provide the account name, the total sales of all orders for the customer, and the level. Order with the top spending customers listed first.
SELECT a.name AccountName, SUM(o.total_amt_usd),
CASE WHEN SUM(o.total_amt_usd) > 200000 THEN 'Top Level' WHEN SUM(o.total_amt_usd) <= 200000 AND SUM(o.total_amt_usd) >= 100000 THEN 'Second Level' WHEN SUM(o.total_amt_usd) < 100000 THEN 'Lowest Level' END AS Level
FROM accounts a
JOIN orders o
ON a.id = o.account_id
GROUP BY a.name
ORDER BY SUM(o.total_amt_usd) DESC;
accountname
sum
level
EOG Resources
382873.30
Top Level
Mosaic
345618.59
Top Level
IBM
326819.48
Top Level
General Dynamics
300694.79
Top Level
Republic Services
293861.14
Top Level
Leucadia National
291047.25
Top Level
Arrow Electronics
281018.36
Top Level
Answer Key
SELECT a.name, SUM(total_amt_usd) total_spent,
CASEWHENSUM(total_amt_usd) > 200000 THEN 'top'
WHENSUM(total_amt_usd) > 100000 THEN 'middle'
ELSE 'low' ENDAS customer_level
FROM orders o
JOIN accounts a
ON o.account_id = a.idGROUPBY a.nameORDERBY 2 DESC;
The same results.
We would now like to perform a similar calculation to the first, but we want to obtain the total amount spent by customers only in 2016 and 2017. Keep the same levels as in the previous question. Order with the top spending customers listed first.
SELECT DATE_PART('year', o.occurred_at) AS Year, a.name AccountName, SUM(o.total_amt_usd),
CASE WHEN SUM(o.total_amt_usd) > 200000 THEN 'Top Level' WHEN SUM(o.total_amt_usd) <= 200000 AND SUM(o.total_amt_usd) >= 100000 THEN 'Second Level' WHEN SUM(o.total_amt_usd) < 100000 THEN 'Lowest Level' END AS Level
FROM accounts a
JOIN orders o
ON a.id = o.account_id
WHERE DATE_PART('year', o.occurred_at) = '2016' OR DATE_PART('year', o.occurred_at) = '2017'
GROUP BY a.name, DATE_PART('year', o.occurred_at)
ORDER BY SUM(o.total_amt_usd) DESC;
year
accountname
sum
level
2016
Pacific Life
255319.18
Top Level
2016
Mosaic
172180.04
Second Level
2016
CHS
163471.78
Second Level
2016
Core-Mark Holding
148105.93
Second Level
2016
Disney
129157.38
Second Level
2016
National Oilwell Varco
121873.16
Second Level
2016
Sears Holdings
114003.21
Second Level
Answer Key
SELECT a.name, SUM(total_amt_usd) total_spent,
CASEWHENSUM(total_amt_usd) > 200000 THEN 'top'
WHENSUM(total_amt_usd) > 100000 THEN 'middle'
ELSE 'low' ENDAS customer_level
FROM orders o
JOIN accounts a
ON o.account_id = a.idWHERE occurred_at > '2015-12-31'
GROUPBY 1
ORDERBY 2 DESC;
This returns same results, but simpler queries. I realize I would need to work on strategy on manipulating data. Here, instead of giving a condition with DATE_PART, it simply uses > DATE since 2016 and 2017 is beyond 2015-12-31 and data available is only till 2017.
We would like to identify top-performing sales reps, which are sales reps associated with more than 200 orders. Create a table with the sales rep name, the total number of orders, and a column with top or not depending on if they have more than 200 orders. Place the top salespeople first in your final table.
SELECT s.name SalesRepName, COUNT(o.id) TotalOrder, CASE WHEN COUNT(o.id) > 200 THEN 'Top' WHEN COUNT(o.id) <= 200 THEN 'Not' END AS TopPerformer
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN orders o
ON a.id = o.account_id
GROUP BY s.name
ORDER BY COUNT(o.id) DESC;
salesrepname
totalorder
topperformer
Earlie Schleusner
335
Top
Vernita Plump
299
Top
Tia Amato
267
Top
Georgianna Chisholm
256
Top
Moon Torian
250
Top
Nelle Meaux
241
Top
Maren Musto
224
Top
Answer Key
SELECT s.name, COUNT(*) num_ords,
CASEWHENCOUNT(*) > 200 THEN 'top'
ELSE 'not' ENDAS sales_rep_level
FROM orders o
JOIN accounts a
ON o.account_id = a.idJOIN sales_reps s
ON s.id = a.sales_rep_id
GROUPBY s.nameORDERBY 2 DESC;
This also returns the same results.
The previous didn't account for the middle, nor the dollar amount associated with the sales. Management decides they want to see these characteristics represented as well. We would like to identify top-performing sales reps, which are sales reps associated with more than 200 orders or more than 750000 in total sales. The middle group has any rep with more than 150 orders or 500000 in sales. Create a table with the sales rep name, the total number of orders, total sales across all orders, and a column with top, middle, or low depending on these criteria. Place the top salespeople based on the dollar amount of sales first in your final table. You might see a few upset salespeople by this criteria!
SELECT s.name SalesRepName, COUNT(o.id) TotalOrder, SUM(o.total_amt_usd) TotalOrderUSD, CASE WHEN COUNT(o.id) > 200 OR SUM(o.total_amt_usd) > 750000 THEN 'Top' WHEN COUNT(o.id) = 200 OR COUNT(o.id) > 150 OR SUM(o.total_amt_usd) = 750000 OR SUM(o.total_amt_usd) > 500000 THEN 'Middle' ELSE 'Low' END AS TopPerformer
FROM sales_reps s
JOIN accounts a
ON s.id = a.sales_rep_id
JOIN orders o
ON a.id = o.account_id
GROUP BY s.name
ORDER BY SUM(o.total_amt_usd) DESC;
salesrepname
totalorder
totalorderusd
topperformer
Earlie Schleusner
335
1098137.72
Top
Tia Amato
267
1010690.60
Top
Vernita Plump
299
934212.93
Top
Georgianna Chisholm
256
886244.12
Top
Arica Stoltzfus
186
810353.34
Top
Dorotha Seawell
208
766935.04
Top
Nelle Meaux
241
749076.16
Top
Sibyl Lauria
193
722084.27
Middle
Maren Musto
224
702697.29
Top
Brandie Riva
167
675917.64
Middle
Charles Bidwell
205
675637.19
Top
Elwood Shutt
191
662500.24
Middle
Maryanna Fiorentino
204
655954.74
Top
RECAP
Each of the sections has been labeled to assist if you need to revisit a particular topic. Intentionally, the solutions for a particular section are actually not in the labeled section, because my hope is this will force you to practice if you have a question about a particular topic we covered.
You have now gained a ton of useful skills associated with SQL. The combination of JOINs and Aggregations is one of the reasons SQL is such a powerful tool.
If there was a particular topic you struggled with, I suggest coming back and revisiting the questions with a fresh mind. The more you practice the better, but you also don't want to get stuck on the same problem for an extended period of time!
In this lesson, we covered and you can now:
Deal with NULL values
Create aggregations in your SQL Queries including
COUNT
SUM
MIN & MAX
AVG
GROUP BY
DISTINCT
HAVING
Create DATE functions
Implement CASE statements
Up next are Subqueries and Temporary Tables. See you there!
KeyTerm
Definition
DISTINCT
Always used in SELECT statements, and it provides the unique rows for all columns written in the SELECT statement.
GROUP BY
Used to aggregate data within subsets of the data. For example, grouping for different accounts, different regions, or different sales representatives.
HAVING
is the “clean” way to filter a query that has been aggregated
NULLs
A datatype that specifies where no data exists in SQL
0 comments