Blocking Robots from crawling site

I have a customer on linux server at Jodo.

He studies his logs religiously and is seeing all kinds of bots crawling his site and wants to stop them.

Here are a couple of snippets from log file he sent:

HTML:
124.115.6.14 - - [17/Mar/2013:02:44:56 -0500] "GET /robots.txt HTTP/1.1" 200 115 "-" "Mozilla/5.0(compatible; Sosospider/2.0;
http://help.soso.com/webspider.htm)"199.21.99.80 - - [17/Mar/2013:02:45:37 -0500] "GET /images_iob/reserve_it1.gif HTTP/1.1"
304 - "-" "Mozilla/5.0 (compatible; YandexImages/3.0;  http://yandex.com/bots)"108.59.8.80 - - [17/Mar/2013:03:16:57 -0500] "GET /robots.txt HTTP/1.0" 200 115 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.3; http://www.majestic12.co.uk/bot.php? )"208.115.111.71 - - [17/Mar/2013:04:12:59 -0500] "GET /robots.txt HTTP/1.1" 200 115 "-"
"Mozilla/5.0 (compatible; Ezooms/1.0; [email protected])"2.


Here is what he says is his current .htaccess file:

HTML:
order deny,allow

deny from all

SetEnvIfNoCase User-Agent "^bingbot/2.0" bad_bot
SetEnvIfNoCase User-Agent "^ezooms/1.0" bad_bot
SetEnvIfNoCase User-Agent "^Ezooms/1.0" bad_bot
SetEnvIfNoCase User-Agent "^Mail.RU_Bot/2.0" bad_bot
SetEnvIfNoCase User-Agent "^MJ12bot" bad_bot
SetEnvIfNoCase User-Agent "^MJ12bot/v1.4.3" bad_bot
SetEnvIfNoCase User-Agent "^Sosospider" bad_bot
SetEnvIfNoCase User-Agent "^Sosospider/2.0" bad_bot
SetEnvIfNoCase User-Agent "^YandexBot" bad_bot
SetEnvIfNoCase User-Agent "^Yandex/1\.01\.001" bad_bot
SetEnvIfNoCase User-Agent "^YandexBot/3\.0" bad_bot
SetEnvIfNoCase User-Agent "^YandexBot/3.0" bad_bot

order allow,deny
allow from all
deny from env=bad_bot
deny from 93.63.195.11
deny from 91.232.96.40

He is saying it is not working and bots are still coming through. He would like to know how to stop them. Is he missing something? Not sure why he wants to stop Bing as well, but he does.

Thanks,
Greg
 
Bad bots do ignore robots.txt but good bots dont, like bing, yandex is also a legit bot but mostly Russian traffic, so if no need for it, fine to block. I'd say to make sure he has a robots.txt as well.
 
Bad bots do ignore robots.txt but good bots dont, like bing, yandex is also a legit bot but mostly Russian traffic, so if no need for it, fine to block. I'd say to make sure he has a robots.txt as well.

This all he had in robots.txt

HTML:
User-agent: msnbot
Disallow: /

He said he felt like the others ignored the robots.txt so he did NOT add them. He also said he added a sitemap (XML) a while back as well.
 
If he wants no crawling at all, robots.txt should contain:

User-agent: *
Disallow: /
 
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

would allow Google only, but as previously stated many bots just ignore it anyhow. If some are being a particular pain you can often block their IP.
 
Back
Top