Misbehaving bots, scripts kiddies etc etc

AussieDave

24 years & still going!
Joined
Nov 28, 2013
Messages
5,103
Reaction score
3,607
Sure I don't need to remind anyone running a wp site about all the degenerates trying to gain access on a daily basis. To try and counter act this, I used Wordfence but that doesn't stop it, especially if your getting hit with a botnet. Then there's Baudi who sucks bandwidth along with and other scrappers like majestic and so on who I just don't want on site. Have since trashed WF and replaced it with ZBBlock, which is setup outside wp but uses an include file in wp-load.php, this keeps all the nasties at bay, even before they get access to any pages.

Do others take a proactive stance against these misbehaving bots, especially for those such as Majestic and co, who scrap your content and make it available for sale to your competitors?
 

Online18Casino

Affiliate Guard Dog Member
Joined
Sep 7, 2009
Messages
596
Reaction score
58
I have been using 'Better WP Security' for about six months now and they seem to have eliminated all the above problems. Give it a try, didn't affect my rankings, got rid of bot attacks and blocks known bad bots. Also renames /wp-admin so bots cant find the login page to test.

When you install it, do the basic set up.

Then enable #9 on the dashboard to "Block known bad hosts and agents with HackRepair.com's blacklist.. "
Then enable #11 to "Hide admin login area"

That should help, it helped me a great deal. My host kept trying to shut me down for bandwidth problems due to the botnet attacks.. This put a stop to that.
 
Last edited:

Vladi

Affiliate Guard Dog Member
Joined
Feb 4, 2008
Messages
772
Reaction score
115
How do you keep the "bad bots" out - while allowing the "good bots" in (Search engine spiders).

LOL if only someone had a simple answer to that one! A well programmed bot that mimics a search engine can be very difficult to keep out. It depends on your time and effort vs reward tradeoff.

One of the best way to catch scrapers is by using a honeypot. You include a hidden link on your site that humans don't see (hidden via CSS) and exclude it from being spidered using robots.txt. All "good bots" will follow instructions in robots.txt so they will never follow the link. The bad bots usually just follow every link on your site regardless of robots.txt, especially if they are a scraper. So as soon as they hit your honeypot page you know they can be banned. The advantage of this approach is that you don't have to keep lists of bots that are always out of date, instead they are simply blocked by their own behaviour. There are tools to help with this such as Project Honeypot, which can allow you to ban bad bots that other sites have detected.

However if someone is smart enough to scrape your site using a bot that follows robots.txt directives, and advertises itself as GoogleBot or Bing or something else legitimate via its user agent identifier then you're going to have trouble blocking it unless you want to go through the painful and never ending process of keeping lists of valid IP addresses for each known good bot.

Also one that not many people are aware of is that if you have allowed Google to cache your site then there's nothing you are ever going to be able to do to stop someone scraping that copy unless you tell Google to stop caching it via your meta tags.

In addition there are software solutions that aim to detect bad bots based on the headers they send and other misconfigurations that allow them to be identified even if they are attempting to cloak themselves as Google. For example Bad Behaviour. But these can be unreliable and run the risk of blocking legitimate users that are behind proxies etc.
 

AussieDave

24 years & still going!
Joined
Nov 28, 2013
Messages
5,103
Reaction score
3,607
there are software solutions that aim to detect bad bots based on the headers they send and other misconfigurations that allow them to be identified even if they are attempting to cloak themselves as Google. For example Bad Behaviour. But these can be unreliable and run the risk of blocking legitimate users that are behind proxies etc.

That essentially what Zbblock http://www.spambotsecurity.com/zbblock.php does but it also has a *.csv file (updated on a regular basis) with 1000's of known blacklisted IP's. It goes further and checks against places like stopfurmspam and a few other blacklists. It also blocks out Tor and runs a number of other checks. It also has a *.ini files which allows you to further customise what it flags. All operational files are stored in a protected vault environment.

Though, I'm just wondering if it's blocking legitimate users. Obviously don't want that to happen. It's why I commenced this thread to get feedback :)

No good reason for majestic and those other bots scrapping your site, if all they do is make that data available for sale to competitors. Sure you can purchase software to spy on what another site is doing in the way of seo and link building but you can block those IP's too ;)
 
Last edited:

rak

Affiliate Guard Dog Member
Joined
Dec 2, 2010
Messages
60
Reaction score
2
However if someone is smart enough to scrape your site using a bot that follows robots.txt directives, and advertises itself as GoogleBot or Bing or something else legitimate via its user agent identifier then you're going to have trouble blocking it unless you want to go through the painful and never ending process of keeping lists of valid IP addresses for each known good bot.

My thoughts exactly.
And I don't think its too difficult to do.

eg

use curl to send user agent as googlebot
read robots.txt on directory level
follow instructions on deny / allow
spider site (grab local domain links on each page being spidered) with each requests as curl user agent googlebot


if the above spider basic algo didn't check the robots.txt - the honeypot would work.
 

AussieDave

24 years & still going!
Joined
Nov 28, 2013
Messages
5,103
Reaction score
3,607
My thoughts exactly.
And I don't think its too difficult to do.

eg

use curl to send user agent as googlebot
read robots.txt on directory level
follow instructions on deny / allow
spider site (grab local domain links on each page being spidered) with each requests as curl user agent googlebot


if the above spider basic algo didn't check the robots.txt - the honeypot would work.

Yep it's a simple no brainer.

But what if the bot identifies as ABC, checks the robots.txt and obeys deny / allow, yet is still there for dubious reason.

Here are a few kill logs from ZBBlock (Even fake baidu bots are banned):

#: 3332 @: Wed, 15 Jan 2014 21:06:26 +1100 Running: 0.4.10a3 / 73
Host: baiduspider-180-76-5-203.crawl.baidu.com
IP: 180.76.5.203
Score: 2
Violation count: 1 INSTA-BANNED
Why blocked: Baidu access is not allowed on this site. (CUST-IB-HN-CN) INSTA-BAN. No access allowed from China (CUST-IP-1235/CN-20130910) You have been instantly banned due to extremely hazardous behavior!
Query:
Referer:
User Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2
Reconstructed URL: http:// mydomain .com /robots.txt


#: 3333 @: Wed, 15 Jan 2014 21:29:19 +1100 Running: 0.4.10a3 / 73
Host: 192.3.182.181
IP: 192.3.182.181
Score: 1
Violation count: 1
Why blocked: Illegal Character, Illegal Configuration, Empty Field, or Too Many Characters in RDNS (HNB-000). Dangerous hostname detected! Neutralized (HNB-FIX).
Query:
Referer: http:// mydomain .com/
User Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Reconstructed URL: http:// mydomain .com /

#: 3329 @: Wed, 15 Jan 2014 19:25:32 +1100 Running: 0.4.10a3 / 73
Host: ns224303.ovh.net
IP: 46.105.115.184
Score: 1
Violation count: 1
Why blocked: Cloud Services. Not an access provider ISP. Allows IP hopping. (CLD-0210).
Query:
Referer:
User Agent:
Reconstructed URL: http:// mydomain .com /atom.xml


The point to this thread was asking if other webmasters block bots and what not who are up to mischief or do you let anything and everything crawl your sites?
 
Top