How Principal Component Analysis Helps Get the Data You Actually Need

by | Jan 19, 2018

Data visualization brings datasets to life and enables people to understand what the data says.

How­ev­er, if your data con­tains more than three dimen­sions (e.g. age, gen­der, eye col­or and hair col­or) it is extreme­ly dif­fi­cult to cre­ate scat­ter plots or his­tograms — you need more sophis­ti­cat­ed forms of data visu­al­iza­tion to draw use­ful con­clu­sions.

Prin­ci­pal Com­po­nent Analy­sis (PCA) reduces the num­ber of dimen­sions in a dataset so you can elim­i­nate clut­ter and pay atten­tion to the dimen­sions that real­ly mat­ter. To illus­trate how it works, let’s take a look at a com­mon­place busi­ness exam­ple. Here’s how PCA helped me pre­pare invi­ta­tions for a busi­ness din­ner.

The chal­lenge:

I’m orga­niz­ing a din­ner for a small group of Search Dis­cov­ery exec­u­tives, clients, and prospects.  I can’t invite every­one, so I have to iden­ti­fy who to invite. To help me decide, I’ve cre­at­ed a dataset that ranks each per­son across the fol­low­ing fea­tures with a num­ber between 0 and 1:

  • Reli­a­bil­i­ty
  • Loca­tion
  • Lev­el of senior­i­ty
  • Poten­tial rev­enue
  • Stage of sales cycle
  • Like­li­hood to become a cus­tomer

Below is a sam­ple from my data set.

Name of prospectReli­a­bil­i­tyLoca­tionLev­el of Senior­i­tyPoten­tial Rev­enueStage of Sales Cycle Like­li­hood to become a cus­tomer
John Doe5.04.53.26.25.54.7
Jane Smith4.35.46.35.83.13.6
Rachel Geller5.02.14.95.36.85.0

 

Click here to see my dataset in detail

This is an unsu­per­vised learn­ing prob­lem because I do not have a spe­cif­ic type of per­son in mind to invite to the din­ner. My pri­ma­ry objec­tive is to invite a group of peo­ple who will gen­er­ate the most rev­enue for my com­pa­ny (as deter­mined by these six dimen­sions). To do this, I will run the analy­sis and then, I’ll select the best clus­ter to receive invi­ta­tions.

The solu­tion:

Prin­ci­pal Com­po­nent Analy­sis (PCA) is a pop­u­lar tech­nique for reduc­ing the num­ber of dimen­sions in a dataset. In short, the goal of PCA is to find the big pat­terns in the data, and dis­card the noise.

It works by find­ing the few fea­tures that account for the great­est vari­a­tion. For exam­ple, facial recog­ni­tion soft­ware con­verts the col­or of each pix­el of an image into a num­ber. Those num­bers are then ordered into vec­tors (vec­tori­sa­tion), effec­tive­ly mak­ing each image a row in a spread­sheet where the num­ber rep­re­sent­ing the col­or of each pix­el is a sep­a­rate col­umn.

How­ev­er, images of faces con­tain mil­lions of pix­els and com­par­ing all of them would take too long to be use­ful.

When you apply PCA to a group of images, infor­ma­tion that doesn’t mat­ter will be removed, e.g. PCA might dis­card the fact that the per­son in an image has two eyes because that is pret­ty com­mon and doesn’t hold the key to match­ing two images. How­ev­er, data relat­ing to the size of the space between the eyes will be retained because that varies from each image.

Now back to the din­ner par­ty. For our pur­pos­es, we want to see if PCA can be use­ful for reduc­ing the dimen­sions in our dataset from six to three so it can be visu­al­ized and under­stood.

Results

I’ve used an Embed­ding Pro­jec­tor by TensorFlow.org to apply PCA to my dataset. The tool applies PCA to my dataset, finds the few vec­tors that account for the great­est vari­a­tion, and then plots each data point against the top three — see below for the results or click here to see them in 3D!

With­in the embed­ding pro­jec­tor, you can use the selec­tor tool at the top left of the chart to select a group of points and look at the meta­da­ta asso­ci­at­ed with each point. When you do this, you’ll notice that prospects who score high­ly across the six dimen­sions appear high on the X-axis, and those who have low scores appear fur­ther down the X-axis. This is because 94.4% of the vari­ance in this dataset was cap­tured by the first vec­tor that PCA found.

As a result, I can be con­fi­dent that the clus­ter of prospects that are high on the X axis are both like­ly to become cus­tomers and com­pat­i­ble with each oth­er across the six dimensions)—the per­fect peo­ple to invite to the din­ner.

A word of cau­tion

There are cas­es where PCA is not great for iden­ti­fy­ing clus­ters. For exam­ple, if our sam­ple dataset con­tained a lot of dimen­sions (imag­ine 200 columns rather than 6) it would like­ly wash out clus­ters. Also, in my dataset, PCA was able to iden­ti­fy three vec­tors that describe 98.5% of the vari­ance, which makes it ide­al for plot­ting in three dimen­sions. How­ev­er, not all datasets can be so accu­rate­ly described by the top three vec­tors found by PCA.

To resolve those two prob­lems, you should take a look at t-SNE. But that’s a blog post for anoth­er day. In the mean­time, use this blog post as a guide for work­ing some PCA mag­ic on your datasets! Feel free to reach out to me if you’ve got more ques­tions about PCA (or any of your data needs) and I’d be hap­py to chat.