Website Content Protection Scraper Prevention: Live Blog PubCon Las Vegas


Plusone Facebook Twitter Linkedin

Welcome to the Website and Content Protection – Scraper Prevention session at PubCon Las Vegas.

Ever had your content stolen? Ever seen your hard earned work rank higher on a competitors site than on yours? This session will look at ways you can prevent getting your site ripped.

William Atchison, Founder, Crawl Wall

Brett Tabke CEO, WebmasterWorld.com

Up first is Bill Atchison

Transitioning from free-for-all bot abuse to a tightly controlled site access.

My website was under attack
- bots from China were scraping 10% more daily page views than Google, Yahoo and MSN

- defining good bots
- motives behind bad bots
- some bots mask themselves for example Getty

Google bot identification
- Good bots ask for robots.txt
- don’t bang site abusively fast

Bad bots
- go to any link
- spoof engines
- if you don’t validate from Google you’ll find spoof Googlebot via translation pages.
- Crawl as fast as possible
- Crawl from lots of IP addresses

Idea being violate your copyrights and repackage sites

Motivation is something for nothing
- build website using your content
- mine information
- get traffic
- make your money

7.8 billon dollars in scrapping
- WSJ reported Neilsen broke into PatientsLikeMe and scraped private patient data.

Who are all the bots?
- intelligence gathering
- copyright compliance
- branding
- security
- media monitoring
- safe site solutions

- Content scrapers (theft)
- data loggers

Stealth bots vs Visible bots

Bill’s chart shows how stealth bots screw analytics.

How scraper bots use your content?
- intercept traffic

- scraped pages scrambled together to make new content and avoid duplicate content penalties. This was done by feeding scraper it’s own IP. Bill shows illustrations…

- Cloaked scrapers hide your content from users and only show it to Googlebot. Bill submitted this via Google Webmaster Tools. Totally unrelated to the scraped content. Scraper activity can directly impact your reputation.

How to Get bots under control?
- opt in vs opt-out bot blocking
- opt in traffic analytis
- profiling and detecting stelth bots vs visitors
- setting spider traps
- avoid search engine pitfalls

Be sure not to block things like Feedburner but its possible to block bad bots.

Robots.tx won’t stop a bot unless the bot honors robots.txt

IP blocking at firewall degrades server performance

OPT-IN Bot Paradign Shift

- Authorize good bots only no more blcaklists as everything is blocked by default

- Narrow search engine access by IP range to prevent spoofing and page hijacking via proxy sites

- Blocking traffic is risky, understand the risks.

- Review traffic

- Google analytics uses JavaScript to track traffic thus eliminating bots from reports.

Techniques to detect humans:
- some bots use cookies
- few execute JavaScript
- bots hardly examine css
- rarely do bots download images
- monitor speed
- observe quantity of page results
- watch for access to robots.txt
- validate page requests
- verify if user agents are valid
- check ips

- Don’t allow engines to archive pages via noarchive

- Tell unauthorized robots that crawling is forbidden by dynamically inserting no-crawl directives noindex nofollow

- even with archive cache disabled, scrapers extract lists of valid page names from search engines to defeat spider traps

Ways to protect site.
- reverse DNS
- forward DNS to be sure reverse isn’t faked
- dynamic robots.txt
- Block entire IP ranges for web hosts or facilitate access for scraper sites
- for blocking large lists of IPs such as proxy lists, use PHP

Tighten site access
- opt in
- spider traps
- stealth bot profiling

Get better results:
- tighter control
- improve search engine rankings after removing unwanted competition
- better server performance for visitors

THANKS BILL!

Brett is up next…

WebMasterWorld.com cloaks robots.txt executible script as robots.txt so legit bots are out.

HTACCESS bans

Target high ranking pages via cut and paste. Webmaster.com HTML link to URL via css to make link invisible, often humans will go in the other direction. Webmaster cloaks nofollow in an ETHICAL way that engines know about.

(THIS is uber advanced, I’d suggest not doing this unless you know exactly what you are doing)

HTACCESS with reverse DNS it’s possible to block nearly everything.

High value content served via AJAX

Brett finds that bot stuff is highly related to rankings.

Be aware,

Fake Email to remove inbound links…

Google Notice….

Inbound link spam

Profile attacks
Reputation attacks

Don’t allow folks to comment on PPC landing pages….

Report site by AntiVirus

Report sites to black lists

Spoof you on site

Send Google reinclusion request for a site that isn’t banned

Google has 1,000 folks + reviewing quality.

“The publicity exposed UI of Google is only the tip of the iceberg…. there’s lots of interesting information that we know how to extract but haven’t figured out how to present it to users….

Try and create a bad neighborhood intentionally involving another site.

Who-is-Who Tango
- find webhost
-signup an act in same ip block
- reg your domain

Domanin Name Attack: Play book

- buy links from traceable link programs

Other attacks
- buy tlds then send requests
- spam search suggestions
-ddos attacks
-adsense attacks

xss cross site scripting attacts
- form software vulnerablities

Sniffing wifi at conference

WordPress, control panels, forums, social sites, Google Webmaster Tools

Business Hacks
- send out resumes for competitiors IT staff

- send prject director bad press about IT press

Defenses:
IT IS NOT IF BUT WHEN

- monitor trademark
- affiliate ids Google Alerts
- personal names
- products

-Domaintools
- Watch top competitors registrations as well

Protect brand and trademarks
- register your trademarks

knowem.com

- Never allow UGGC on PPC landing page

Anti scriaperscripts…

Print IP/host to screen and html comments and headers

Use copyscape for high value content

Use self referreral links, content, words and copy

Hide self links with css blind links

Plusone Facebook Twitter Linkedin

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">