5 Advanced SQL queries every Data Engineer & Scientist should know!

Over the years, I’ve worked on dozens of data engineering projects and these 5 advanced SQL queries have queries have gotten me out of few jams! Hope you find them useful

1. Using the SQL APPLY Operator

Ever find yourself trying to find the last transaction, a set of customers made? Struggled with trying to use combinations of the row_number(), top 1, along with a group and order by? There is an easier way with the APPLY operator and it’s the first of our 5 advanced SQL queries!

APPLY is a table operator used in the FROM clause and can be either a CROSS APPLY or an OUTER APPLY. The OUTER APPLY is similar to a LEFT OUTER JOIN except that the right table expression will get executed once for each row in the left table.

Going back to the original question of how to get the latest transactions for a set of customers, Using the AdventureWorks DB, we have a Customer and a SalesOrderHeader table. Below is a sample of the data present in each:

Advanced SQL query - base data

For each customer record in the first result we would like to know the 5 latest orders based on the second result set:

---Get the last 5 orders for every customer
select  c.customerid 
        ,c.[AccountNumber]
        ,o.OrderDate
        ,o.SalesOrderID
from sales.customer as c
outer apply 
/* This expression will get executed once for each row in the outer set (i.e. customer records 29516 & 29913)  */
(
 --Now we can say get the last 5 orders for the current customer in the outer set 
    select top (5) soh.orderDate, soh.SalesOrderID
    from [Sales].[SalesOrderHeader] as soh
    where soh.CustomerID = c.CustomerID
    order by soh.orderDate desc
) as o
where c.customerid in (29516,29913)
order by c.customerid

Here’s the desired output:

Advanced SQL query - apply operator results

Conceptually, you can think of this as looping through the customer table and running the inner select query for the current CustomerID.

2. Recursive CTEs

Second up on our advanced SQL queries works with recursive CTEs (Common Table Expressions). Recursive CTEs come into their own when dealing with hierarchical data such as company org structures. Recursive CTEs are CTEs that reference themselves and contain at least two CTE query definitions, an anchor member and a recursive member.

For example, lets say you have an employee org structure as the below where DimEmployee table is self referencing and you want to get a list of each manager and their direct reports.

In order to get a table with each line manager and their direct reports, we need to recursively search the table. We’ll start at the top of the org chart and select the record for the CEO (it will be the one with no line manager i.e. where ParentEmployeeKey = NULL). This will be our anchor query that we will recursively join to as we work down through the org.

Anchor result

Next we will add in the recursive query. It must be combined with the anchor query with a UNION ALL and it must also refer only once to the anchor query. There are 2 very important things to note about this.

  • Because it’s a UNION ALL, we need to adhere to the UNION ALL criteria and have the same number of columns in the anchor query and the recursive query.
  • Secondly, it means that each time the recursive query is executed, the results get appended to the anchor query and any subsequent executions of the recursive query will refer to this updated result set ( I guess that’s why they call them recursive CTEs!).

  with Manager as
  (
     --Create anchor query
     select employeekey,
         firstname as EmployeeFirstName, 
         lastname as EmployeeLastName, 
         [Title] as EmployeeTitle,  
         ParentEmployeeKey, 
         cast(Null as nvarchar(50)) as ManagerFirstName, 
         cast(Null as nvarchar(50)) as ManagerLastName, 
         cast(Null as nvarchar(50)) as ManagerTitle , 
         1 as ManagerLevel
     FROM [dbo].[DimEmployee]
     --most senior manager will not have a line manager
     where ParentEmployeeKey is null 

     --Create the recursive member that will reference the anchor query. This will get executed for every level in the hierarchy i.e. until no more rows are returned
     union all
     select e.EmployeeKey,
            e.firstname,
            e.lastname, 
            e.title, 
            e.ParentEmployeeKey,
            m.EmployeeFirstName,
            m.EmployeeLastName, 
            m.EmployeeTitle, 
            M.ManagerLevel +1  --this will give the manager level
     from [dbo].[DimEmployee] e 
join Manager M on m.EmployeeKey = e.Parentemployeekey --refer once to anchor query
)
select * from manager order by ManagerLevel

The desired result set shows the manager and their reportees.

Advanced SQL queries - recursive CTE results

3. UNPIVOT

More regularly than I would like, I get get excel files with poorly normalized financial data, that I need to load into the database and transform. An example of such is the below income statement for Apple that has the line items as each row and the year of revenue bucketed into columns. Needless to say, adding an additional column into excel for 2021 is a significantly simpler task than doing so in a DB.

Advanced SQL query - sample apple income statement

The UNPIVOT relational operator will quickly rotate the columns into column values and transform it into something that’s more normalized and a lot more workable for a Data Engineer.

With UNPIVOT, you need to specify two additional columns. The column that will contain the column values that you are rotating (2020, 2019,….) will be called “Year” and the column that will hold the values will be called “IncomeLineItemAmount”

  --create the normalised results set
  select IncomeLine, [Year],IncomelineItemAmount
  from 

	(
	--create the derived table that will be pivoted
	select IncomeLine, [2020],[2019],[2018],[2017],[2016] 
	from [dbo].[AppleIncomeStatement] 
	) as pvt
	--unpivot the above derived table pvt
	UNPIVOT 
		(
		incomelineItemAmount for [Year] in ( [2020],[2019],[2018],[2017],[2016]  )
		) as unpvt

We now have a more normalized schema that we can store the data as and work with more easily

Advanced SQL query unpivot result

4. INTERSECT and EXCEPT

The sometimes undervalued cousin to the UNION operator, the INTERSECT and EXCEPT set operators have gotten me out of a few binds over the years. If you’re not familiar with the EXCEPT or INTERCEPT operators, they follow the same syntactical rules as UNION. The below Venn diagrams illustrate the result set to expect from each operator:

  • EXCEPT returns distinct rows from the left input query that aren’t output by the right input query.
  • INTERSECT returns distinct rows that are output by both the left and right input queries operator.

EXCEPT and INTERSECT have saved me a few times when a few tables have gotten out of sync and I needed to do some exploratory work to see how bad the damage was but a simpler example of its usefulness is to see which employees are also customers:

select  [FirstName],  [LastName] from [dbo].[DimEmployee]
intersect
select  [FirstName], [LastName] from [dbo].[DimCustomer]
advanced sql query - unpivot result

It should be noted that you can in theory reproduce the same result set for EXCEPT and INTERSECT using either outer and inner joins respectively but if there is no unique key and the join would require numerous columns, it’s simpler and cleaner to use EXCEPT and INTERSECT

5. Running Totals

Finally, knowing how to find a running total at a particular point in time is a must for any Data Analyst and window functions really simplify it. Window functions, if you aren’t familiar with them are similar to the GROUP BY aggregate clause, except that instead of working on the entire result set, they operate on a set of rows and return a single aggregated value for each row. There’s more information about them in Microsoft Books Online

The syntax for window functions is:

window_function() --can be an aggregate, ranking, offset or distribution function
OVER (   

       --optional arguments
       [ <PARTITION BY clause> ]  
       [ <ORDER BY clause> ]   
       [ <ROW or RANGE clause> ]  
      )  
  • OVER specifies the window clause for the window function
  • PARTITION BY divides the query result set into partitions (i.e. creates the frame or window)
  • ORDER BY defines the order of the rows within the partition
  • ROW or RANGE limits the rows within the partition by specifying start and end points within the partition

For example, let’s say we wanted to know the total spend for each customer. We can use the sum() aggregate window function and partition the result set by customerID to achieve this.

SELECT CustomerID
      ,OrderDate
      ,TotalDue as TransactionTotal
     ,sum(TotalDue) over (partition by customerid) as CustomerTotal
  FROM [Sales].[SalesOrderHeader]

You can see from the above result set that the CustomerTotal column contains the sum of spending for each customer. The PARTITION BY clause has limited the sum() function result set to rows with a common customerID.

Now let’s add in a running total spend for a particular customer. To add in the running total, we want get the sum of transactions up until the current transaction. To do that we add an ORDER BY OrderDate clause after the PARTITION BY clause in the previous example.

SELECT CustomerID
       ,OrderDate
       ,TotalDue as TransactionTotal
       ,sum(TotalDue) over (partition by CustomerID order by orderdate, salesorderid) as CustomerRunningTotal 
       ,sum(TotalDue) over (partition by customerid) as CustomerTotal
  FROM [Sales].[SalesOrderHeader]
  where CustomerID = 29994
advanced sql query - running totals

You might think it strange that adding an ORDER BY generates a running total but it’s part of the ANSI SQL standard that if you say ORDER BY it is implicit that you are using a framing clause to only include a certain amount of rows. When you write:

 sum(TotalDue) over (partition by CustomerID order by orderdate, salesorderid)

you are implicitly writing:

sum(TotalDue) over (partition by CustomerID order by orderdate, salesorderid)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

*It should be noted that the ORDER BY in the partition only orders the data in the partition. It’s just coincidence if the total result set is ordered the same way.

Conclusions

To conclude, I hope you found this article on advanced SQL queries useful. If you did please help us out and give it a share or check out some others at www.dataengineeringacademy.com

1 thought on “5 Advanced SQL queries every Data Engineer & Scientist should know!”

  1. This is this kind of a fantastic resource that you are providing and also you give it absent for free. I adore seeing web sites that understand the value of offering a top quality resource free of charge. It?s the old what goes around comes about routine.

Leave a Comment

Your email address will not be published. Required fields are marked *

Share via
Copy link
Powered by Social Snap