Practitioners of SQL and other data manipulation languages encounter predictable issues when they’re asked to create a report that needs a unique key where one may not exist in the data. This post gives detailed steps of how to solve this problem in a real-world context with SQL window functions.
A Salesforce sales funnel question that wasn’t so simple to answer
Opening up with SQL window functions
Since I knew that Salesforce had a data object table called “Opportunity History,” I approached this problem using SQL window functions. “Opportunity History” consists of rows of data that each represent a change on a particular deal (Opportunity). An example of this dataset can be seen below.
The question seemed simple enough to answer…until I got into the data transformation.
Data transformation hurdles
First, I filtered the history data by joining it to the current table of Opportunities and filtering to just opportunity records where Stage = Legal. This got me only the history records for Opportunities currently in the Legal Stage.
Then, I calculated a column that referenced the next opportunity history record’s created date to get the “To Date” for the current record. I partitioned the data for this calculation by Opportunity ID to ensure that it was only applying to the appropriate Opportunities. If the history record was the most up-to-date record for the opportunity, then the “To Date” was calculated as the current day. I also formatted the Created Date of the record to a new column called “From Date”.
This gave me the time period that each record was the most up-to-date record for each Opportunity. All that was left was to count the number of days that those records were valid in an additional column by calculating the number of days between the “From Date” and “To Date” and to sum up the total number of days for each Opportunity to get my total time in Legal by Opportunity. Easy right?
Once I completed this first pass, I did what any good practitioner would and checked the data on a few opportunities to ensure I was getting the correct numbers. After I spot-checked two or three, I came upon an Opportunity that raised a question for my client. Most deals looked to end in the “Legal” stage, but sometimes they fell apart and reverted to a previous stage before going back into Legal.
The client confirmed that that could happen and asked that I only count the number of days that an Opportunity was in Legal Status for the most recent Legal stage change.
Identifying the "gaps and islands" problem
What I quickly realized was that I would have to partition the dataset to get the legal records I needed. The challenge that I needed to solve was how to partition on the Stage names when they were not separated by unique values.
I searched the internet for “partitioning on non-unique values” and found several articles explaining a concept called the “gaps and islands” problem.
The concept involves using the dense_rank function on multiple columns in a dataset, then taking the difference of those ranks to get to the desired partition. This circumvents the problem of trying to partition data based on non-unique columns by artificially creating the partition through ranking functions.
Solving the gaps and islands problem using SQL window functions
First, I needed to create two columns. The first column combined the Opportunity ID—that is, the unique identifier that represented every Opportunity—and concatenated that ID with the Stage Name on each history record. This gave me the partition needed to separate each Opportunity History record by Stage and was called “Opportunity Stage ID”.
The second column I needed to create also used the Opportunity ID and the “Created Date” of the history record, which was formatted as a date timestamp and could not have identical values for the same Opportunity. This gave me a unique field that allowed me to sort every history record by date in descending order and was called “Opportunity Created Date Rank”.
The resulting two columns are shown below.
The next step was to calculate two multi-row functions using dense_rank functions to create two new columns in the dataset, each using the new fields I created in the previous step. Dense rank returns the rank of each record in the partition based on order. Ranks are consecutive. They don’t repeat when the values match.
The resulting two new columns appeared in the table below. “Opportunity Stage Rank” used the Opportunity Stage ID as its ranking value and “Opportunity Row Rank” used Opportunity Created Date Rank as its ranking value.
Finally, the last step was to subtract the two dense rank columns to get the correct partition rank values. This allowed me to only include the Opportunity History record values that had the most recent Legal Stage values. I called this column “Opportunity Sequence Rank.”
All that was left to do after that was to filter the dataset so that only records with the new ranking value of 0 remained, and then to sum up the ‘Time In Stage” column calculation from my original work. The reason I filtered to the Sequence rank value of 0 was because when you compare the two dense rank columns, the numbers should match for all records that are in the most recent Stage. In our case, this is always the most recent Legal stage, which is what we are trying to solve for.
Solving the Gaps and Islands Problem in Salesforce CRM Analytics
The examples above show my solution to this problem using SQL (Specifically in Google BigQuery) however, the actual solution was completed in Salesforce’s Business Intelligence platform called Salesforce CRM Analytics.
In order to show the capabilities of this tool, I have included images from the tool’s ETL user interface (internally referred to as a ‘Recipe’) below to show my method for applying dense ranks to solve the gaps and islands problem inside of CRM Analytics.
The first step taken was to Calculate the Time_In_Stage value for all opportunities in the Opportunity History table. This table as previously mentioned had a row for each Opportunity record change, including Stage changes.
The next step was to filter the current Opportunity table, a dataset with a row for each Opportunity with that Opportunities present values, to only contain current Opportunities in the Legal stage.
Now that we had these two tables, it was time to join them together on Opportunity ID, performing a left join with the Opportunity dataset on the left and the Opportunity History Dataset on the right.
The final transformations were all done in the same ETL node and consisted of creating the Stage and Created ID columns, calculating the dense rank values, and subtracting them to get the sequence rank.
The image above does not show how the dense ranks are calculated, so below are the details of how those calculations are set up starting with the Opportunity_Row_Rank and then the Opportunity_Stage_Rank.
The filtering to the Opportunity_Sequence_Rank of ‘0’ and summing up Time_In_Stage was done inside the CRM Analytics dashboard that was created for the client.
The Gaps and Islands problem was a tough nut to crack. Only after testing out ten wrong ways to solve it using multiple ranking methods and trying to filter by specific stages and dates (each time getting close to the answer but falling short) did I ultimately give up and start looking for a solution via the World Wide Web. After finding several articles that nudged me in the right direction, I was finally able to come up with this solution that got the client what they needed to answer their business question.
Although it was difficult, I ended up standing at my desk after everyone else had gone home for the day, shouting, “I did it! I’m the greatest in the world!” (Yes, I really did that, and no, not everyone had actually gone home for the day.) Solving the problem was satisfying!
These kinds of moments make the daily grind of my job worth it, and they fuel my passion for using data to solve complex problems to drive value for the businesses that I get to work with at Search Discovery. At the end of the day, I knew I now had another tool at my disposal to solve similar use cases, and that made staying late to get this done more than worth it.
When I presented this solution to the client, they were extremely happy with the results. The solution that I provided was something that they had never been able to automate in the past, and it saves them an enormous amount of time in report creation and re-creation. Plus, the reports now inform their management team of all the deals that they can expect to see closing once they’ve passed their legal stage, which allows stakeholders to budget and plan more accurately to make better business decisions.
Read some of Sean’s other helpful posts: