How Principal Component Analysis Helps Get the Data You Actually Need

by | Jan 19, 2018

Data visualization brings datasets to life and enables people to understand what the data says.

However, if your data contains more than three dimen­sions (e.g. age, gender, eye color and hair color) it is extremely diffi­cult to create scatter plots or histograms — you need more sophis­ti­cated forms of data visu­al­iza­tion to draw useful conclu­sions.

Prin­ci­pal Compo­nent Analy­sis (PCA) reduces the number of dimen­sions in a dataset so you can elim­i­nate clutter and pay atten­tion to the dimen­sions that really matter. To illus­trate how it works, let’s take a look at a common­place busi­ness example. Here’s how PCA helped me prepare invi­ta­tions for a busi­ness dinner.

The chal­lenge:

I’m orga­niz­ing a dinner for a small group of Search Discovery exec­u­tives, clients, and prospects.  I can’t invite every­one, so I have to iden­tify who to invite. To help me decide, I’ve created a dataset that ranks each person across the follow­ing features with a number between 0 and 1:

  • Reli­a­bil­ity
  • Loca­tion
  • Level of senior­ity
  • Poten­tial revenue
  • Stage of sales cycle
  • Like­li­hood to become a customer

Below is a sample from my data set.

Name of prospect Reli­a­bil­ity Loca­tion Level of Senior­ity Poten­tial Revenue Stage of Sales Cycle Like­li­hood to become a customer
John Doe 5.0 4.5 3.2 6.2 5.5 4.7
Jane Smith 4.3 5.4 6.3 5.8 3.1 3.6
Rachel Geller 5.0 2.1 4.9 5.3 6.8 5.0

 

Click here to see my dataset in detail

This is an unsu­per­vised learn­ing problem because I do not have a specific type of person in mind to invite to the dinner. My primary objec­tive is to invite a group of people who will gener­ate the most revenue for my company (as deter­mined by these six dimen­sions). To do this, I will run the analy­sis and then, I’ll select the best cluster to receive invi­ta­tions.

The solu­tion:

Prin­ci­pal Compo­nent Analy­sis (PCA) is a popular tech­nique for reduc­ing the number of dimen­sions in a dataset. In short, the goal of PCA is to find the big patterns in the data, and discard the noise.

It works by finding the few features that account for the great­est vari­a­tion. For example, facial recog­ni­tion soft­ware converts the color of each pixel of an image into a number. Those numbers are then ordered into vectors (vectori­sa­tion), effec­tively making each image a row in a spread­sheet where the number repre­sent­ing the color of each pixel is a sepa­rate column.

However, images of faces contain millions of pixels and compar­ing all of them would take too long to be useful.

When you apply PCA to a group of images, infor­ma­tion that doesn’t matter will be removed, e.g. PCA might discard the fact that the person in an image has two eyes because that is pretty common and doesn’t hold the key to match­ing two images. However, data relat­ing to the size of the space between the eyes will be retained because that varies from each image.

Now back to the dinner party. For our purposes, we want to see if PCA can be useful for reduc­ing the dimen­sions in our dataset from six to three so it can be visu­al­ized and under­stood.

Results

I’ve used an Embed­ding Projec­tor by TensorFlow.org to apply PCA to my dataset. The tool applies PCA to my dataset, finds the few vectors that account for the great­est vari­a­tion, and then plots each data point against the top three — see below for the results or click here to see them in 3D!

Within the embed­ding projec­tor, you can use the selec­tor tool at the top left of the chart to select a group of points and look at the meta­data asso­ci­ated with each point. When you do this, you’ll notice that prospects who score highly across the six dimen­sions appear high on the X-axis, and those who have low scores appear further down the X-axis. This is because 94.4% of the vari­ance in this dataset was captured by the first vector that PCA found.

As a result, I can be confi­dent that the cluster of prospects that are high on the X axis are both likely to become customers and compat­i­ble with each other across the six dimensions)—the perfect people to invite to the dinner.

A word of caution

There are cases where PCA is not great for iden­ti­fy­ing clus­ters. For example, if our sample dataset contained a lot of dimen­sions (imagine 200 columns rather than 6) it would likely wash out clus­ters. Also, in my dataset, PCA was able to iden­tify three vectors that describe 98.5% of the vari­ance, which makes it ideal for plot­ting in three dimen­sions. However, not all datasets can be so accu­rately described by the top three vectors found by PCA.

To resolve those two prob­lems, you should take a look at t-SNE. But that’s a blog post for another day. In the mean­time, use this blog post as a guide for working some PCA magic on your datasets! Feel free to reach out to me if you’ve got more ques­tions about PCA (or any of your data needs) and I’d be happy to chat.