Fun Uses for Windowing Functions

Standard

I’ve recently come to love (read:be obsessed with) windowing functions in my coding. They’re just so useful and practical.

For those who haven’t experienced the joys of windowing, here’s the deal. They allow you to do calculations across multiple rows without actually having to group, thereby storing aggregate info on each record. That means you keep all the data associated with the row and can add calculated fields that rely on interaction with other rows. Pretty swiffy, huh?

Below are just a few funky functions that I’ve found helpful. I’m not saying that these aren’t resource intensive, but they may just save you from having to join to some crazy aggregation sub-queries and then export to Excel for further manipulation to get the same result.

Number an Account’s Transactions by Recency

This has a mess of useful applications. For one, you can quickly identify not only which specific date the customer had their first transaction (not the only way to do this… we’ll get to that in a minute), but also which one it was and all of its associated properties. For two, you can calculate intra-account indices for things like purchase size (i.e. their second purchase was 50% smaller than their own average…). There are more, but you probably get the point by now.

row_number() OVER (PARTITION BY account_id ORDER BY txn_date DESC, txn_number DESC)

Here’s the Step-by-Step:

  1. Use the ROW_NUMBER() function to generate a serial sequence starting with 1
  2. OVER starts the windowing statement – the part that identifies how it will calculate the row number
  3. Specify what to PARTITION BY based upon where you want the function to start counting from 1. In this case, we want to get row numbers for each account_id so that a given account_id will have values 1 – n where n is their number of transactions, presumably
  4. If the order of counting the rows matters, specify that you will ORDER BY one or more fields. The example uses txn_date DESC AND txn_number DESC to have the function start at one for the last transaction from the customer based on date and transaction number. This could be helpful if you have a date with no timestamp and an increasing transaction number each day, for instance.

Caveats:

  • Technically, the code above is just one step in the process. You would use this in a subquery and then select where whatever you call the row_number calculation equals 1 to get the first transaction.
  • PostgreSQL sorts the data for output based on one of the PARTITION BY / ORDER BY sets. Specify your order by for output if you want to see it displayed a different way.

Finding Fiscal Week from Year, Month and Week of Month

This came up at a client site when we realized that the date dimension in a data warehouse was missing the fiscal week of the year. This client operates on a non-standard calendar that makes finding the week number a bit tricky. What we did have was a fiscal year, fiscal month and week of fiscal month to go by. Thanks to row_number (this time in Netezza rather than PostgreSQL), we were able to quickly generate a fiscal week number to use in analysis.

row_number() over (partition by fiscal_year order by fiscal_month, fiscal_week_of_month)

Here’s the Step-by-Step:

  1. Use the ROW_NUMBER() function to generate a serial sequence starting with 1
  2. OVER starts the windowing statement – the part that identifies how it will calculate the row number
  3. Specify what to PARTITION BY based upon where you want the function to start counting from 1. In this case, we want to restart the counting at the beginning of each year
  4. Specify that you will ORDER BY fiscal_month and fiscal_week_of_month to count up appropriately

Caveats:

  • What you don’t see is a step in which we selected the distinct fiscal_year, fiscal_month and fiscal_week_of_month to avoid getting a new fiscal week of year for each date. The need for this would vary based on the structure of the dimension.

More Windowing Functions

Thus far, we’ve only talked about the row number function. This is but one of many windowing functions available. Here’s the skinny on some more:

  1. Rank: The rank function does pretty much what it sounds like – it ranks records from 1 to n. But there is a key difference between RANK and ROW_NUMBER – namely that RANK allows for ties whereas ROW_NUMBER is a simple sequence. For instance, if you ranked your sales people by total sales and had three people each sold $25,000 they would have the same rank. This eliminates the need to order by another element to break ties.
  2. Sum: While you may not have used it this way before, the lowly SUM function can also be windowed. This allows you to calculate the total within a window.
  3. Avg: Much like SUM, AVG can also be used to calculate the mean within a window.
  4. Ntile: The NTILE function calculates a percentile and can be used to assign records to the percentile group. See my post on Calculating Percentiles in PostgreSQL for more info on this bad boy.
  5. Lag: This allows you to pull the value from another record within the window based on an offset from the current record. There’s a snazzy post, When Lag is Awesome, where you can get more details.

This article is subject to updates over time as I get more opportunity to work on it or people hit me up with questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.