Odd entries in web log

Discussion in 'TechTalk' started by riley, Apr 24, 2004.

  1. riley

    riley Perch

    I didn't know where to post this, so I put it here.

    Over the last couple months I've seen entries like this in my log files (I obfuscated my domain name *****):
    date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken
    2004-04-24 18:18:10 GET / - 80 guest Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+NT+5.0) - - [url]www.***********.com[/url] 401 1 1326 1812 237 109
    There are a few for each of my domains and the client was refused access with a 401 error in each case, but I'm concerned about the persistance of the attempts. They are all from the same IP address:

    Arin shows the following for that address:
    OrgName:    ClearBlue Technologies
    OrgID:      CLEAR-1
    Address:    125 Elwood Davis Road
    City:       Syracuse
    StateProv:  NY
    PostalCode: 13219
    Country:    US
    NetRange: -
    NetName:    NAVI-A84B0000-16-0
    NetHandle:  NET-168-75-0-0-1
    Parent:     NET-168-0-0-0-0
    NetType:    Direct Allocation
    Updated:    2004-02-26
    Has anybody else seen log entries like this?

  2. riley

    riley Perch

    After investigating this IP address I have discovered that it is a bad bot (misbehaving spider). I was not able to find out what information this bot looks for; it might be completely harmless and ligitimate, or it could be harvesting email address for spam lists. But the way it tries to access web sites is clearly out-of-bounds for any bot these days.

    Bots are supposed to look for and read a robots.txt file in the site's root. In that text file, you can disallow folders and pages; i.e., tell the bot where you don't want it go.

    Short of using the robots.txt file, a bot should at least look for meta tags in the default page that opens when it accesses the site. The meta tags can direct the bot to index or not index (process or ignore) the page and direct the bot to follow or not follow any links that might be found in the page.

    This bot does neither. Instead it simply tries to browse the root folder. This behavior is unexceptable by today's standards. But with "folder browsing" turned off on the Windows server, the bot gets a "401 Not Authorized" error (as you can see in the log listing in the previous post), so there is no problem. Of course, if you have requested that tech support enable "folder browsing" for your site, this bot will browse its way through your site. On a Linux server, you can ban this IP address with the .htaccess file, but I don't know how to (or if you can) do that on a Windows server. This cannot be done via any Scripting code (ASP, ASP.Net, etc.) because the bot is browsing the folders, not executing any scripts.

    I hope that sheds a little light on the issue.
  3. SubSpace

    SubSpace Bass

    Just always have folder browsing disabled.
    If you for some odd reason need to browse a folder, make a script that lists all the files in the directory.
  4. yorri

    yorri Perch

    Jodohost should BAN THE SUCKER!!!! for us all hehe

Share This Page

JodoHost - 26,000 hosting end-users in 100 countries
Plesk Web Hosting
VPS Hosting
H-Sphere Web Hosting
Other Services