Friday, August 21, 2009

Brand Dimensions still won't stop scraping my site

Despite having a no BDFetch robots.txt directive, Brand Dimensions has downloaded hundreds of my original pages with photos on them. None of these pages mention any brand names of any companies, so I'm curious as to what BD is really doing. I'm guessing they could also provide some serious competitive intelligence to their clients. I just wonder what happens when they represent competing companies, like Coke and Pepsi. Here are some representative entries from my log files:

/var/log/httpd/access_log.1:72.14.164.139 - - [11/Aug/2009:07:25:27 -0400] "GET /carleton/reunionweb/WebPage-Full.00001.html HTTP/1.1" 200 1394 "-" "LinkWalker/2.0"
/var/log/httpd/access_log.1:72.14.164.140 - - [11/Aug/2009:07:25:42 -0400] "GET /carleton/reunionweb/WebPage-Full.00015.html HTTP/1.1" 200 1468 "-" "LinkWalker/2.0"
/var/log/httpd/access_log.1:72.14.164.197 - - [11/Aug/2009:07:25:57 -0400] "GET /carleton/reunionweb/WebPage-Full.00018.html HTTP/1.1" 200 1468 "-" "LinkWalker/2.0"
/var/log/httpd/access_log.1:72.14.164.157 - - [11/Aug/2009:07:26:12 -0400] "GET /carleton/reunionweb/WebPage-Thumb.00023.html HTTP/1.1" 200 3648 "-" "LinkWalker/2.0"
/var/log/httpd/access_log.1:72.14.164.179 - - [11/Aug/2009:07:26:27 -0400] "GET /carleton/reunionweb/WebPage-Full.00013.html HTTP/1.1" 200 1468 "-" "LinkWalker/2.0"
/var/log/httpd/access_log.1:72.14.164.193 - - [11/Aug/2009:07:26:42 -0400] "GET /skiing/webdest/WebPage-Full.00011.html HTTP/

Brand Dimensions switched the name of their bot to sidestep robots.txt directives. Based on my own Google Analytics info, I can safely say a lot of people are interested in what Brand Dimensions is doing and how to stop it. More LinkWalker info here. Other webmasters report that the LinkWalker agent is also used by spambots harvesting email addresses for phishing attacks and the like.

Here are my latest robots.txt lines:
User-agent: BDFetch
Disallow: /
User-agent: BPImageWalker
Disallow: /
User-agent: VoilaBot
Disallow: /
User-Agent: LinkWalker/2.0
Disallow: /
User-Agent: LinkWalker
Disallow: /



1 comment:

  1. I only know unix. So for .htaccess:
    # domains
    RewriteCond %{HTTP_REFERER} cbwatch.com [NC,OR]
    RewriteCond %{HTTP_REFERER} copyscape\.com [NC]
    RewriteRule ^(.*)$ - [F]
    #user agents
    Order Deny,Allow
    Deny from env=bad_bot
    BrowserMatchNoCase LinkWalker\/2\.0 bad_bot
    BrowserMatchNoCase Mon_httpDownload bad_bot
    BrowserMatchNoCase ZmEu bad_bot
    robot.txt is for nice bots

    ReplyDelete