Wel­come to the Web­site and Con­tent Pro­tec­tion – Scraper Pre­ven­tion ses­sion at Pub­Con Las Vegas.

Ever had your con­tent stolen? Ever seen your hard earned work rank high­er on a com­peti­tors site than on yours? This ses­sion will look at ways you can pre­vent get­ting your site ripped.

William Atchi­son, Founder, Crawl Wall

Brett Tabke CEO, WebmasterWorld.com

Up first is Bill Atchi­son

Tran­si­tion­ing from free-for-all bot abuse to a tight­ly con­trolled site access.

My web­site was under attack 
— bots from Chi­na were scrap­ing 10% more dai­ly page views than Google, Yahoo and MSN

  • defin­ing good bots 
    • motives behind bad bots 
    • some bots mask them­selves for exam­ple Get­ty

Google bot iden­ti­fi­ca­tion
— Good bots ask for robots.txt
— don’t bang site abu­sive­ly fast

Bad bots
— go to any link 
— spoof engines 
— if you don’t val­i­date from Google you’ll find spoof Google­bot via trans­la­tion pages. 
— Crawl as fast as pos­si­ble
— Crawl from lots of IP address­es

Idea being vio­late your copy­rights and repack­age sites

Moti­va­tion is some­thing for noth­ing
— build web­site using your con­tent
— mine infor­ma­tion
— get traf­fic
— make your mon­ey

7.8 bil­lon dol­lars in scrap­ping
WSJ report­ed Neilsen broke into Patients­LikeMe and scraped pri­vate patient data.

Who are all the bots? 
— intel­li­gence gath­er­ing
— copy­right com­pli­ance
— brand­ing
— secu­ri­ty
— media mon­i­tor­ing
— safe site solu­tions

  • Con­tent scrap­ers (theft)
    • data log­gers

Stealth bots vs Vis­i­ble bots

Bill’s chart shows how stealth bots screw ana­lyt­ics.

How scraper bots use your con­tent?
— inter­cept traf­fic

  • scraped pages scram­bled togeth­er to make new con­tent and avoid dupli­cate con­tent penal­ties. This was done by feed­ing scraper it’s own IP. Bill shows illus­tra­tions…

  • Cloaked scrap­ers hide your con­tent from users and only show it to Google­bot. Bill sub­mit­ted this via Google Web­mas­ter Tools. Total­ly unre­lat­ed to the scraped con­tent. Scraper activ­i­ty can direct­ly impact your rep­u­ta­tion.

How to Get bots under con­trol?
— opt in vs opt-out bot block­ing
— opt in traf­fic ana­lytis
— pro­fil­ing and detect­ing stelth bots vs vis­i­tors
— set­ting spi­der traps 
— avoid search engine pit­falls

Be sure not to block things like Feed­burn­er but its pos­si­ble to block bad bots.

Robots.tx won’t stop a bot unless the bot hon­ors robots.txt

IP block­ing at fire­wall degrades serv­er per­for­mance

OPT-IN Bot Par­a­dign Shift

  • Autho­rize good bots only no more blcak­lists as every­thing is blocked by default

  • Nar­row search engine access by IP range to pre­vent spoof­ing and page hijack­ing via proxy sites

  • Block­ing traf­fic is risky, under­stand the risks.

  • Review traf­fic

  • Google ana­lyt­ics uses JavaScript to track traf­fic thus elim­i­nat­ing bots from reports.

Tech­niques to detect humans: 
— some bots use cook­ies
— few exe­cute JavaScript 
— bots hard­ly exam­ine css 
— rarely do bots down­load images 
— mon­i­tor speed 
— observe quan­ti­ty of page results 
— watch for access to robots.txt
— val­i­date page requests 
— ver­i­fy if user agents are valid 
— check ips

  • Don’t allow engines to archive pages via noarchive

  • Tell unau­tho­rized robots that crawl­ing is for­bid­den by dynam­i­cal­ly insert­ing no-crawl direc­tives noin­dex nofol­low

  • even with archive cache dis­abled, scrap­ers extract lists of valid page names from search engines to defeat spi­der traps

Ways to pro­tect site. 
— reverse DNS 
— for­ward DNS to be sure reverse isn’t faked 
— dynam­ic robots.txt
— Block entire IP ranges for web hosts or facil­i­tate access for scraper sites 
— for block­ing large lists of IPs such as proxy lists, use PHP

Tight­en site access 
— opt in 
— spi­der traps 
— stealth bot pro­fil­ing

Get bet­ter results: 
— tighter con­trol
— improve search engine rank­ings after remov­ing unwant­ed com­pe­ti­tion
— bet­ter serv­er per­for­mance for vis­i­tors

THANKS BILL!

Brett is up next…

WebMasterWorld.com cloaks robots.txt exe­cutible script as robots.txt so legit bots are out.

HTACCESS bans

Tar­get high rank­ing pages via cut and paste. Webmaster.com HTML link to URL via css to make link invis­i­ble, often humans will go in the oth­er direc­tion. Web­mas­ter cloaks nofol­low in an ETHICAL way that engines know about.

(THIS is uber advanced, I’d sug­gest not doing this unless you know exact­ly what you are doing)

HTACCESS with reverse DNS it’s pos­si­ble to block near­ly every­thing.

High val­ue con­tent served via AJAX

Brett finds that bot stuff is high­ly relat­ed to rank­ings.

Be aware,

Fake Email to remove inbound links…

Google Notice….

Inbound link spam

Pro­file attacks 
Rep­u­ta­tion attacks

Don’t allow folks to com­ment on PPC land­ing pages….

Report site by AntiVirus

Report sites to black lists

Spoof you on site

Send Google rein­clu­sion request for a site that isn’t banned

Google has 1,000 folks + review­ing qual­i­ty.

The pub­lic­i­ty exposed UI of Google is only the tip of the ice­berg…. there’s lots of inter­est­ing infor­ma­tion that we know how to extract but haven’t fig­ured out how to present it to users….

Try and cre­ate a bad neigh­bor­hood inten­tion­al­ly involv­ing anoth­er site.

Who-is-Who Tan­go
— find web­host
-signup an act in same ip block 
— reg your domain

Domanin Name Attack: Play book

  • buy links from trace­able link pro­grams

Oth­er attacks 
— buy tlds then send requests 
— spam search sug­ges­tions
-ddos attacks 
-adsense attacks

xss cross site script­ing attacts 
— form soft­ware vul­ner­a­bli­ties

Sniff­ing wifi at con­fer­ence

Word­Press, con­trol pan­els, forums, social sites, Google Web­mas­ter Tools

Busi­ness Hacks 
— send out resumes for com­pe­ti­tiors IT staff

  • send prject direc­tor bad press about IT press

Defens­es:
IT IS NOT IF BUT WHEN

  • mon­i­tor trade­mark
    • affil­i­ate ids Google Alerts 
    • per­son­al names 
    • prod­ucts

-Domain­tools
— Watch top com­peti­tors reg­is­tra­tions as well

Pro­tect brand and trade­marks
— reg­is­ter your trade­marks

knowem.com

  • Nev­er allow UGGC on PPC land­ing page

Anti scri­a­per­scripts…

Print IP/host to screen and html com­ments and head­ers

Use copy­scape for high val­ue con­tent

Use self refer­reral links, con­tent, words and copy

Hide self links with css blind links