Market-Basket-Analysis

Market Basket Analysis (MBA) is a popular rule-based machine learning technique that can provide product recommendations to customers. This post shows you how to pull and wrangle the transactional data from the Google Analytics API.  

What is Market Basket Analysis?

In short, MBA takes a look at all the items purchased together and counts how often each item is purchased with others. This allows us to look at our transactional data and answer basic questions like, What are our top products? It also allows us to probe deeper and suggest other products based on the ones already purchased. 

I won’t go over the math behind this model because that has been written about ad nauseam. I will go over the output variables, how to interpret them, and what that means for your business. If you are just here for the code, you can skip this whole post and visit this GitHub repo with the code.

Market Basket analysis shares a common DNA with recommender systems—systems that use machine learning to personalize recommendations for customers as scale (see below). However, these systems are mathematically and computationally complex. Market Basket Analysis is a great entry point to recommender systems because: 

  • All the data we need is available through the GA API (and Adobe) and is easily obtained.
  • The data wrangling for this model is minimal, and there is no PII to cleanse.
  • The Market Basket rules and output are extremely simple to explain and interpret.

Google Analytics R

Pulling

If you are using Google Analytics to capture web data for your eCommerce site, then you can perform this analysis. The market basket analysis requires us to know two things: all of the products a user bought and a unique identifier for each cart. A nice-to-know thing would be item price. To pull all of these from the GA API, you need to pull the following:

Metrics
-itemQuantity – How many of this item was purchased.
-itemRevenue – How much was the total of the items purchased.

Dimensions
-productName – Friendly name of the product purchased.
-transactionId – Unique Identification of each cart.

This sample GA pull will get you what you need.

df <- google_analytics(gaId, 
                date_range = c("2020-01-01", "2020-01-31"),
                metrics = c("itemQuantity", 'itemRevenue'),
                dimensions = c('productName', 'transactionId'),
                anti_sample = TRUE

                                )

And it should look like this.

image4 3

Wrangling

Now that we have some data, we need to do some wrangling. We need to calculate our item price. This is simple: we just divide revenue by quantity. We also need to make each purchased item its own row of data. For this, we leverage the uncount function in R.

df <- df %>%
  mutate(
    itemCost = itemRevenue / itemQuantity
  ) %>%
uncount (weights = itemQuantity)

image3 3

Next, we need to take the data and pivot it wider. This means taking each transaction and placing each item in the transaction in a column separated by a comma. We will then blow those comma-separated items into individual columns. The code below will do the trick. 

df <- plyr::ddply(df,c('transactionId'),
      function(tf1)paste(tf1$productName,
                        collapse = ',')) %>%
  print()

## Separate items to columns
df <- tidyr::separate(df,'V1',
                into = paste('item',1:max_n,sep = "_"),

    sep = ',')

Write this to a csv, and we will be ready to run a market basket analysis with it.

Bringing it all together

Creating Rules

Now that we have a clean set of data to work with, we are ready to run the model. We need to read the file in using the read.transactions function in R. This will create the necessary basket object. This will also de-duplicate items within their individual baskets. Now, we can run the apriori function on the data to create a list of rules.  

Interpreting Rules

You should come out with a dataset that looks similar to the following. Let’s talk about what we see, what it means to us, and how to explain it to other folks.

image2 2

Support

Plainly put, the support is how often the itemset in question appears in the dataset, which is important because it helps us understand whether or not items are frequently purchased together. Support is calculated by taking these items together, no matter if it’s true or not, summing them, and then dividing them by however many items there are. It’s a ratio.

The size of the support is going to be relative to the data set. The higher the support, the more often people purchased these items together. Though the numbers for support may be in decimals or seem low, remember this can be the case if you sell many disparate items. We truly only care about this number in its relation to the other itemsets. We can even scale this number from 0 to 100 to help make sense of it in analysis. Items with high support are often purchased together and can be both priced and marketed with the other item in mind.

Support is also important because it gives you context. When I see that avocados and bananas were bought together, what does that mean to me? Answer: it all depends on the context. If you’re a grocery store and you see something happening 2% of the time, that’s huge, but in a business where there are fewer option combinations that might not be as big of a deal.

To capitalize on this insight, again, context is king: You need to compare support to the other metrics (confidence, lift, and count). In fact, that’s why it’s called support, it’s a fine number for placing into context the reliability of items being purchased together.

Confidence

Confidence tells us how often the itemset rule is found to be true. In other words, when we see the item(s) on the left, it’s confidence that lets us know how often we’re likely to see an accurate prediction on the right. This number is important because it helps us do a better job of recommending items. The higher the confidence, the better the item(s) on the left do to predict the items on the right. 

Lift

Lift tells us how dependent the ruleset is, i.e., how much the right-hand side is dependent on the left-hand side. If the lift is 1, the relationship between the items is independent. If it’s -1, then the items are generally not seen together either because they’re completely different (a square box and a rectangle lid won’t be shown together, or they’re substitutes of each other (strawberries and organic strawberries). If the number is above 1, there’s a strong correlation that these two things should go together (peanut butter and jelly). Lift is important for brands since they get to do neat things with it. In this example, I’m thinking, Wow! Now that we have a lift of 2.75, let’s start making sure that the avocados are next to the bananas.

Count

Count is exactly as it seems. It shows us how many times the ruleset appears. Contrast this with support, which shows us as a ratio of how often a ruleset occurs. Count is just the raw number of times that the dataset exists. Support is derived from count, and if you know the support, you can back into the count, but it’s good because it shows you raw numbers. It’s important if you want to be able to say something happens x amount of times, it doesn’t give a lot of context–there’s nothing to compare a count to. 

So, how do I use it?

So now that we have performed our market basket analysis, what can we do with it?  First, we can use this to cross-sell. If a customer buys all the items on the left side of an itemset, we can follow up with an email—and maybe even a coupon—for the item on the right-hand side. Secondly, we can use our market basket analysis to establish a recommender system. That is, when someone views the item (or even if they have the items in the cart) from the left-hand side of a rule, we can apply machine learning to recommend the item on the right hand side on either a product detail page or at checkout.  

From Market Basket Analysis to Recommender System

Market basket analysis can be a gateway to a recommender system, which can use real-time data to recommend complementary products to users live. To do this, we use machine learning to suggest products, actions, or content to users based on similar content or the behavior of other similar users. Recommender systems are a workhorse method with multiple applications across businesses to provide benefits including improved retention, increased sales, reduced costs, and persuasion (they nudge customers to form purchasing habits). 

But among all these benefits, recommender systems have two particular core areas of importance. First, a recommender system can improve customer engagement in order to lead to higher-value interactions. Next, a recommender system can increase basket size, which leads to higher-probability conversions at purchase time. 

At Search Discovery, we offer customized recommendation systems that integrate fully into a client’s system so that you’re able to collect customer data and automatically analyze this data to generate customized recommendations. These systems rely on both implicit data such as browsing history and purchases and explicit data such as ratings provided by the user to drive revenue and generate demand for our clients.

Please fill out the inquiry form to learn how Search Discovery can help you to engage shoppers, increase order value, and ultimately increase revenue for your business.  

Summary

There you have it—a quick, simple way to measure eCommerce GA data using R and the API. Now you have an easily interpreted and explained method to analyze your purchase data and product sets. Using this, you can now make recommendations, offer deals, and learn about your users. Take this code for a spin and give it a shot yourself.

Want to learn more about market basket analysis and recommender systems? Reach out here.

Related Posts

Join the Conversation

Check out Kelly Wortham’s Optimization based YouTube channel: Test & Learn Community.

Search Discovery
Education Community

Join Search Discovery’s new education community and keep up with the latest tools, technologies, and trends in analytics.

Follow Us

Scroll to Top