Welcome to the Website and Content Protection – Scraper Prevention session at PubCon Las Vegas.
Ever had your content stolen? Ever seen your hard earned work rank higher on a competitors site than on yours? This session will look at ways you can prevent getting your site ripped.
William Atchison, Founder, Crawl Wall
Brett Tabke CEO, WebmasterWorld.com
Up first is Bill Atchison
Transitioning from free-for-all bot abuse to a tightly controlled site access.
My website was under attack
— bots from China were scraping 10% more daily page views than Google, Yahoo and MSN
- defining good bots
- motives behind bad bots
- some bots mask themselves for example Getty
Google bot identification
— Good bots ask for robots.txt
— don’t bang site abusively fast
— go to any link
— spoof engines
— if you don’t validate from Google you’ll find spoof Googlebot via translation pages.
— Crawl as fast as possible
— Crawl from lots of IP addresses
Idea being violate your copyrights and repackage sites
Motivation is something for nothing
— build website using your content
— mine information
— get traffic
— make your money
7.8 billon dollars in scrapping
— WSJ reported Neilsen broke into PatientsLikeMe and scraped private patient data.
Who are all the bots?
— intelligence gathering
— copyright compliance
— media monitoring
— safe site solutions
- Content scrapers (theft)
- data loggers
Stealth bots vs Visible bots
Bill’s chart shows how stealth bots screw analytics.
How scraper bots use your content?
— intercept traffic
scraped pages scrambled together to make new content and avoid duplicate content penalties. This was done by feeding scraper it’s own IP. Bill shows illustrations…
Cloaked scrapers hide your content from users and only show it to Googlebot. Bill submitted this via Google Webmaster Tools. Totally unrelated to the scraped content. Scraper activity can directly impact your reputation.
How to Get bots under control?
— opt in vs opt-out bot blocking
— opt in traffic analytis
— profiling and detecting stelth bots vs visitors
— setting spider traps
— avoid search engine pitfalls
Be sure not to block things like Feedburner but its possible to block bad bots.
Robots.tx won’t stop a bot unless the bot honors robots.txt
IP blocking at firewall degrades server performance
OPT-IN Bot Paradign Shift
Authorize good bots only no more blcaklists as everything is blocked by default
Narrow search engine access by IP range to prevent spoofing and page hijacking via proxy sites
Blocking traffic is risky, understand the risks.
Techniques to detect humans:
— bots hardly examine css
— rarely do bots download images
— monitor speed
— observe quantity of page results
— watch for access to robots.txt
— validate page requests
— verify if user agents are valid
— check ips
Don’t allow engines to archive pages via noarchive
Tell unauthorized robots that crawling is forbidden by dynamically inserting no-crawl directives noindex nofollow
even with archive cache disabled, scrapers extract lists of valid page names from search engines to defeat spider traps
Ways to protect site.
— reverse DNS
— forward DNS to be sure reverse isn’t faked
— dynamic robots.txt
— Block entire IP ranges for web hosts or facilitate access for scraper sites
— for blocking large lists of IPs such as proxy lists, use PHP
Tighten site access
— opt in
— spider traps
— stealth bot profiling
Get better results:
— tighter control
— improve search engine rankings after removing unwanted competition
— better server performance for visitors
Brett is up next…
WebMasterWorld.com cloaks robots.txt executible script as robots.txt so legit bots are out.
Target high ranking pages via cut and paste. Webmaster.com HTML link to URL via css to make link invisible, often humans will go in the other direction. Webmaster cloaks nofollow in an ETHICAL way that engines know about.
(THIS is uber advanced, I’d suggest not doing this unless you know exactly what you are doing)
HTACCESS with reverse DNS it’s possible to block nearly everything.
High value content served via AJAX
Brett finds that bot stuff is highly related to rankings.
Fake Email to remove inbound links…
Inbound link spam
Don’t allow folks to comment on PPC landing pages….
Report site by AntiVirus
Report sites to black lists
Spoof you on site
Send Google reinclusion request for a site that isn’t banned
Google has 1,000 folks + reviewing quality.
“The publicity exposed UI of Google is only the tip of the iceberg…. there’s lots of interesting information that we know how to extract but haven’t figured out how to present it to users….
Try and create a bad neighborhood intentionally involving another site.
— find webhost
‑signup an act in same ip block
— reg your domain
Domanin Name Attack: Play book
- buy links from traceable link programs
— buy tlds then send requests
— spam search suggestions
xss cross site scripting attacts
— form software vulnerablities
Sniffing wifi at conference
WordPress, control panels, forums, social sites, Google Webmaster Tools
— send out resumes for competitiors IT staff
- send prject director bad press about IT press
IT IS NOT IF BUT WHEN
- monitor trademark
- affiliate ids Google Alerts
- personal names
— Watch top competitors registrations as well
Protect brand and trademarks
— register your trademarks
- Never allow UGGC on PPC landing page
Print IP/host to screen and html comments and headers
Use copyscape for high value content
Use self referreral links, content, words and copy
Hide self links with css blind links