Welcome to the Website and Content Protec­tion – Scraper Preven­tion session at PubCon Las Vegas.

Ever had your content stolen? Ever seen your hard earned work rank higher on a competi­tors site than on yours? This session will look at ways you can prevent getting your site ripped.

William Atchi­son, Founder, Crawl Wall

Brett Tabke CEO, WebmasterWorld.com

Up first is Bill Atchi­son

Tran­si­tion­ing from free-for-all bot abuse to a tightly controlled site access.

My website was under attack
— bots from China were scrap­ing 10% more daily page views than Google, Yahoo and MSN

  • defin­ing good bots
    • motives behind bad bots
    • some bots mask them­selves for example Getty

Google bot iden­ti­fi­ca­tion
— Good bots ask for robots.txt
— don’t bang site abusively fast

Bad bots
— go to any link
— spoof engines
— if you don’t vali­date from Google you’ll find spoof Google­bot via trans­la­tion pages.
— Crawl as fast as possi­ble
— Crawl from lots of IP addresses

Idea being violate your copy­rights and repack­age sites

Moti­va­tion is some­thing for nothing
— build website using your content
— mine infor­ma­tion
— get traffic
— make your money

7.8 billon dollars in scrap­ping
WSJ reported Neilsen broke into Patients­LikeMe and scraped private patient data.

Who are all the bots?
— intel­li­gence gath­er­ing
— copy­right compli­ance
— brand­ing
— secu­rity
— media moni­tor­ing
— safe site solu­tions

  • Content scrap­ers (theft)
    • data loggers

Stealth bots vs Visible bots

Bill’s chart shows how stealth bots screw analyt­ics.

How scraper bots use your content?
— inter­cept traffic

  • scraped pages scram­bled together to make new content and avoid dupli­cate content penal­ties. This was done by feeding scraper it’s own IP. Bill shows illus­tra­tions…

  • Cloaked scrap­ers hide your content from users and only show it to Google­bot. Bill submit­ted this via Google Webmas­ter Tools. Totally unre­lated to the scraped content. Scraper activ­ity can directly impact your repu­ta­tion.

How to Get bots under control?
— opt in vs opt-out bot block­ing
— opt in traffic analytis
— profil­ing and detect­ing stelth bots vs visi­tors
— setting spider traps
— avoid search engine pitfalls

Be sure not to block things like Feed­burner but its possi­ble to block bad bots.

Robots.tx won’t stop a bot unless the bot honors robots.txt

IP block­ing at fire­wall degrades server perfor­mance

OPT-IN Bot Para­dign Shift

  • Autho­rize good bots only no more blcak­lists as every­thing is blocked by default

  • Narrow search engine access by IP range to prevent spoof­ing and page hijack­ing via proxy sites

  • Block­ing traffic is risky, under­stand the risks.

  • Review traffic

  • Google analyt­ics uses JavaScript to track traffic thus elim­i­nat­ing bots from reports.

Tech­niques to detect humans:
— some bots use cookies
— few execute JavaScript
— bots hardly examine css
— rarely do bots down­load images
— monitor speed
— observe quan­tity of page results
— watch for access to robots.txt
— vali­date page requests
— verify if user agents are valid
— check ips

  • Don’t allow engines to archive pages via noarchive

  • Tell unau­tho­rized robots that crawl­ing is forbid­den by dynam­i­cally insert­ing no-crawl direc­tives noindex nofol­low

  • even with archive cache disabled, scrap­ers extract lists of valid page names from search engines to defeat spider traps

Ways to protect site.
— reverse DNS
— forward DNS to be sure reverse isn’t faked
— dynamic robots.txt
— Block entire IP ranges for web hosts or facil­i­tate access for scraper sites
— for block­ing large lists of IPs such as proxy lists, use PHP

Tighten site access
— opt in
— spider traps
— stealth bot profil­ing

Get better results:
— tighter control
— improve search engine rank­ings after remov­ing unwanted compe­ti­tion
— better server perfor­mance for visi­tors

THANKS BILL!

Brett is up next…

WebMasterWorld.com cloaks robots.txt executible script as robots.txt so legit bots are out.

HTACCESS bans

Target high ranking pages via cut and paste. Webmaster.com HTML link to URL via css to make link invis­i­ble, often humans will go in the other direc­tion. Webmas­ter cloaks nofol­low in an ETHICAL way that engines know about.

(THIS is uber advanced, I’d suggest not doing this unless you know exactly what you are doing)

HTACCESS with reverse DNS it’s possi­ble to block nearly every­thing.

High value content served via AJAX

Brett finds that bot stuff is highly related to rank­ings.

Be aware,

Fake Email to remove inbound links…

Google Notice….

Inbound link spam

Profile attacks
Repu­ta­tion attacks

Don’t allow folks to comment on PPC landing pages….

Report site by AntiVirus

Report sites to black lists

Spoof you on site

Send Google rein­clu­sion request for a site that isn’t banned

Google has 1,000 folks + review­ing quality.

The public­ity exposed UI of Google is only the tip of the iceberg…. there’s lots of inter­est­ing infor­ma­tion that we know how to extract but haven’t figured out how to present it to users….

Try and create a bad neigh­bor­hood inten­tion­ally involv­ing another site.

Who-is-Who Tango
— find webhost
-signup an act in same ip block
— reg your domain

Domanin Name Attack: Play book

  • buy links from trace­able link programs

Other attacks
— buy tlds then send requests
— spam search sugges­tions
-ddos attacks
-adsense attacks

xss cross site script­ing attacts
— form soft­ware vulner­a­bli­ties

Sniff­ing wifi at confer­ence

Word­Press, control panels, forums, social sites, Google Webmas­ter Tools

Busi­ness Hacks
— send out resumes for compe­ti­tiors IT staff

  • send prject direc­tor bad press about IT press

Defenses:
IT IS NOT IF BUT WHEN

  • monitor trade­mark
    • affil­i­ate ids Google Alerts
    • personal names
    • prod­ucts

-Domain­tools
— Watch top competi­tors regis­tra­tions as well

Protect brand and trade­marks
— regis­ter your trade­marks

knowem.com

  • Never allow UGGC on PPC landing page

Anti scri­a­per­scripts…

Print IP/host to screen and html comments and headers

Use copy­scape for high value content

Use self refer­reral links, content, words and copy

Hide self links with css blind links