Solving the Gaps and Islands Problem in Salesforce with SQL Window Functions

Sr. Business Intelligence Associate Sean McClain solves a client's "gaps and islands" problem in Salesforce using SQL window functions. Learn how.

Practitioners of SQL and other data manipulation languages encounter predictable issues when they’re asked to create a report that needs a unique key where one may not exist in the data. This post gives detailed steps of how to solve this problem in a real-world context with SQL window functions.

Every once in a while, a data practitioner comes across a problem that leaves them stumped. They search for answers down endless  internet rabbit holes, bang their heads against a wall hoping to glean some enlightenment, or endlessly ponder the problem with co-workers until they find a solution. At least, this is how I handle these situations.

A Salesforce sales funnel question that wasn’t so simple to answer

I came across just this problem recently while helping a client answer a seemingly simple question using data. The client was using Salesforce to track the progression of their deals  for various businesses as they passed through stages in the sales funnel.  The question they wanted to know was “How many deals are currently in Legal, and how many days has each deal been in the Legal stage?”

Opening up with SQL window functions

Since I knew that Salesforce had a data object table called “Opportunity History,” I approached this problem using SQL window functions. “Opportunity History” consists of rows of data that each represent a change on a particular deal (Opportunity). An example of this dataset can be seen below.

image18

image2
Each Opportunity History ID identifies a unique row, this example only shows one, but there are multiple opportunities in this table identified by their Opportunity ID.

The question  seemed simple enough to answer…until I got into the data transformation.

Data transformation hurdles

First, I filtered the history data by joining it to the current table of Opportunities and filtering to just opportunity records where Stage = Legal. This got me only the history records for Opportunities currently in the Legal Stage.

Then, I calculated a column that referenced the next opportunity history record’s created date to get the “To Date” for the current record. I partitioned the data for this calculation by Opportunity ID to ensure that it was only applying to the appropriate Opportunities. If the history record was the most up-to-date record for the opportunity, then the “To Date” was calculated as the current day. I also formatted the Created Date of the record to a new column called “From Date”.

image17image9

This gave me the time period that each record was the most up-to-date record for each Opportunity. All that was left was to count the number of days that those records were valid in an additional column by calculating the number of days between the “From Date” and “To Date” and to sum up the total number of days for each Opportunity to get my total time in Legal by Opportunity. Easy right?

image7

image1

Once I completed this first pass, I did what any good practitioner would and checked the data on a few opportunities to ensure I was getting the correct numbers. After I spot-checked two or three, I came upon an Opportunity that raised a question for my client. Most deals looked to end in the “Legal” stage, but sometimes they fell apart and reverted to a previous stage before going back into Legal.

image4
Table above showing an opportunity that moves to legal then back to Possible Deal, then On Market, then back to Possibly Deal, and then finally back into Legal.

The client confirmed that that could happen and asked that I only count the number of days that an Opportunity was in Legal Status for the most recent Legal stage change.

Identifying the "gaps and islands" problem

What I quickly realized was that I would have to partition the dataset to get the legal records I needed. The challenge that I needed to solve was how to partition on the Stage names when they were not separated by unique values.

I searched the internet for “partitioning on non-unique values” and found several articles explaining a concept called the “gaps and islands” problem.

The concept involves using the dense_rank function on multiple columns in a dataset, then taking the difference of those ranks to get to the desired partition. This circumvents the problem of trying to partition data based on non-unique columns by artificially creating the partition through ranking functions.

Solving the gaps and islands problem using SQL window functions

First, I needed to create two columns. The first column combined the Opportunity ID—that is, the unique identifier that represented every Opportunity—and concatenated that ID with the Stage Name on each history record. This gave me the partition needed to separate each Opportunity History record by Stage and was called “Opportunity Stage ID”.

The second column I needed to create also used the Opportunity ID and the “Created Date” of the history record, which was formatted as a date timestamp and could not have identical values for the same Opportunity. This gave me a unique field that allowed me to sort every history record by date in descending order and was called “Opportunity Created Date Rank”. 

The resulting two columns are shown below.

image9 1

image6

The next step was to calculate two multi-row functions using dense_rank functions to create two new columns in the dataset, each using the new fields I created in the previous step. Dense rank returns the rank of each record in the partition based on order. Ranks are consecutive. They don’t repeat when the values match.

The resulting two new columns appeared in the table below. “Opportunity Stage Rank” used the Opportunity Stage ID as its ranking value and “Opportunity Row Rank” used Opportunity Created Date Rank as its ranking value.

image9 2

image8
Example with two columns and dense ranks.

Finally, the last step was to subtract the two dense rank columns to get the correct partition rank values. This allowed me to only include the Opportunity History record values that had the most recent Legal Stage values. I called this column “Opportunity Sequence Rank.” 

All that was left to do after that was to filter the dataset so that only records with the new ranking value of 0 remained, and then to sum up the ‘Time In Stage” column calculation from my original work. The reason I filtered to the Sequence rank value of 0 was because when you compare the two dense rank columns, the numbers should match for all records that are in the most recent Stage. In our case, this is always the most recent Legal stage, which is what we are trying to solve for.

image14

image3

Solving the Gaps and Islands Problem in Salesforce CRM Analytics

The examples above show my solution to this problem using SQL (Specifically in Google BigQuery) however, the actual solution was completed in Salesforce’s Business Intelligence platform called Salesforce CRM Analytics.

In order to show the capabilities of this tool, I have included images from the tool’s ETL user interface (internally referred to as a ‘Recipe’) below to show my method for applying dense ranks to solve the gaps and islands problem inside of CRM Analytics.

 

image20
Above is the UI for a Recipe in CRM Analytics, the nodes shown contain the necessary functions to achieve the same result as the SQL transformations previously shown in this article.

The first step taken was to Calculate the Time_In_Stage value for all opportunities in the Opportunity History table. This table as previously mentioned had a row for each Opportunity record change, including Stage changes.

image5

The next step was to filter the current Opportunity table, a dataset with a row for each Opportunity with that Opportunities present values, to only contain current Opportunities in the Legal stage.

image19

Now that we had these two tables, it was time to join them together on Opportunity ID, performing a left join with the Opportunity dataset on the left and the Opportunity History Dataset on the right.

image21

The final transformations were all done in the same ETL node and consisted of creating the Stage and Created ID columns, calculating the dense rank values, and subtracting them to get the sequence rank.

image12

The image above does not show how the dense ranks are calculated, so below are the details of how those calculations are set up starting with the Opportunity_Row_Rank and then the Opportunity_Stage_Rank.

image4 1

image22

The filtering to the Opportunity_Sequence_Rank of ‘0’ and summing up Time_In_Stage was done inside the CRM Analytics dashboard that was created for the client.

If you would like to know more about how I solved this problem in CRM Analytics, please reach out to me at sean.mcclain@searchdiscovery.com or fill out the “contact us” form.

Nailed it!

The Gaps and Islands problem was a tough nut to crack. Only after testing out ten wrong ways to solve it using multiple ranking methods and trying to filter by specific stages and dates (each time getting close to the answer but falling short) did I ultimately give up and start looking for a solution via the World Wide Web. After finding several articles that nudged me in the right direction, I was finally able to come up with this solution that got the client what they needed to answer their business question.

Although it was difficult, I ended up standing at my desk after everyone else had gone home for the day,  shouting, “I did it! I’m the greatest in the world!” (Yes, I really did that, and no, not everyone had actually gone home for the day.) Solving the problem was satisfying! 

These kinds of moments make the daily grind of my job worth it, and they fuel my passion for using data to solve complex problems to drive value for the businesses that I get to work with at Search Discovery. At the end of the day, I knew I now had another tool at my disposal to solve similar use cases, and that made staying late to get this done more than worth it.

When I presented this solution to the client, they were extremely happy with the results. The solution that I provided was something that they had never been able to automate in the past, and it saves them an enormous amount of time in report creation and re-creation. Plus, the reports now inform their management team of all the deals that they can expect to see closing once they’ve passed their legal stage, which allows stakeholders to budget and plan more accurately to make better business decisions.

Read some of Sean’s other helpful posts:

If you want to learn more about how Search Discovery can help solve your toughest data transformation problems, reach out in the “Contact us” form. We're eager to help!

Leave a Comment

Your email address will not be published.

Contact Us

Related Posts

Join the Conversation

Check out Kelly Wortham’s Optimization based YouTube channel: Test & Learn Community.

Search Discovery
Education Community

Join Search Discovery’s new education community and keep up with the latest tools, technologies, and trends in analytics.

Follow Us

Share

Share on facebook
Share on twitter
Share on linkedin
Scroll to Top