How Principal Component Analysis Helps Get the Data You Actually Need
Data visualization brings datasets to life and enables people to understand what the data says.
However, if your data contains more than three dimensions (e.g. age, gender, eye color and hair color) it is extremely difficult to create scatter plots or histograms — you need more sophisticated forms of data visualization to draw useful conclusions.
Principal Component Analysis (PCA) reduces the number of dimensions in a dataset so you can eliminate clutter and pay attention to the dimensions that really matter. To illustrate how it works, let’s take a look at a commonplace business example. Here’s how PCA helped me prepare invitations for a business dinner.
I’m organizing a dinner for a small group of Search Discovery executives, clients, and prospects. I can’t invite everyone, so I have to identify who to invite. To help me decide, I’ve created a dataset that ranks each person across the following features with a number between 0 and 1:
- Level of seniority
- Potential revenue
- Stage of sales cycle
- Likelihood to become a customer
Below is a sample from my data set.
|Name of prospect||Reliability||Location||Level of Seniority||Potential Revenue||Stage of Sales Cycle||Likelihood to become a customer|
Click here to see my dataset in detail
This is an unsupervised learning problem because I do not have a specific type of person in mind to invite to the dinner. My primary objective is to invite a group of people who will generate the most revenue for my company (as determined by these six dimensions). To do this, I will run the analysis and then, I’ll select the best cluster to receive invitations.
Principal Component Analysis (PCA) is a popular technique for reducing the number of dimensions in a dataset. In short, the goal of PCA is to find the big patterns in the data, and discard the noise.
It works by finding the few features that account for the greatest variation. For example, facial recognition software converts the color of each pixel of an image into a number. Those numbers are then ordered into vectors (vectorisation), effectively making each image a row in a spreadsheet where the number representing the color of each pixel is a separate column.
However, images of faces contain millions of pixels and comparing all of them would take too long to be useful.
When you apply PCA to a group of images, information that doesn’t matter will be removed, e.g. PCA might discard the fact that the person in an image has two eyes because that is pretty common and doesn’t hold the key to matching two images. However, data relating to the size of the space between the eyes will be retained because that varies from each image.
Now back to the dinner party. For our purposes, we want to see if PCA can be useful for reducing the dimensions in our dataset from six to three so it can be visualized and understood.
I’ve used an Embedding Projector by TensorFlow.org to apply PCA to my dataset. The tool applies PCA to my dataset, finds the few vectors that account for the greatest variation, and then plots each data point against the top three — see below for the results or click here to see them in 3D!
Within the embedding projector, you can use the selector tool at the top left of the chart to select a group of points and look at the metadata associated with each point. When you do this, you’ll notice that prospects who score highly across the six dimensions appear high on the X-axis, and those who have low scores appear further down the X-axis. This is because 94.4% of the variance in this dataset was captured by the first vector that PCA found.
As a result, I can be confident that the cluster of prospects that are high on the X axis are both likely to become customers and compatible with each other across the six dimensions)—the perfect people to invite to the dinner.
A word of caution
There are cases where PCA is not great for identifying clusters. For example, if our sample dataset contained a lot of dimensions (imagine 200 columns rather than 6) it would likely wash out clusters. Also, in my dataset, PCA was able to identify three vectors that describe 98.5% of the variance, which makes it ideal for plotting in three dimensions. However, not all datasets can be so accurately described by the top three vectors found by PCA.
To resolve those two problems, you should take a look at t-SNE. But that’s a blog post for another day. In the meantime, use this blog post as a guide for working some PCA magic on your datasets! Feel free to reach out to me if you’ve got more questions about PCA (or any of your data needs) and I’d be happy to chat.