Googlebot slurping up bandwidth!

antic · Jun 9, 2009

I spotted a client of mine using 2-3GB of bandwidth each month which seemed very unusual - they shouldn't have so much traffic! I checked AWStats to make sure HSphere was reporting correctly, and noticed something interesting for the current month...

Bots like Googlebot, MSN and Yahoo's Slurp were using a heap of it:

Googlebot = 751 MB
MSNBot = 188 MB
Yahoo Slurp = 108 MB
TOTAL = 1047 MB

Under the breakdown of traffic by file type, Adobe Acrobat is racking up 742 MB of the downloads. Google indexes PDFs so that coincides with the Googlebot usage. However the client only has about 380 MB of PDF files stored in total. So there must be a lot of downloading going on of the same files over and over again.

This client uses a CMS which delivers the files to the browser, using dynamic URLs. Perhaps the CMS is making every file "expire" immediately, causing the bots to re-download them every time they crawl the site?

Is there an easy way to tell bots not to download PDF files? That would do the trick for now, while I investigate document expiry...

Any other suggestions?

Stephen · Jun 9, 2009

it does seem it is redownloading for sure.

What about putting the PDFs in a certai folder and denying legit spider activity there with a robots.txt?

antic · Jun 9, 2009

Stephen said:
it does seem it is redownloading for sure.
What about putting the PDFs in a certai folder and denying legit spider activity there with a robots.txt?

Howdy Stephen. I'd rather not, as it's a medical research site, so their docs (presentations, scientific papers, etc) are pretty important to them and need to be indexed.

I'm fiddling with the CMS code now... it does seem that everything is set to immediate expiry (eg. the classic asp Response.Expires = 0) so I'm going to basically force all downloads (css, js, pdf, ppt, etc. even gif and jpg) to expire in 7 days and then see what happens.

Do you think 7 days is too long? What's the normal/usual document expiry time?

Stephen · Jun 9, 2009

antic said:
Howdy Stephen. I'd rather not, as it's a medical research site, so their docs (presentations, scientific papers, etc) are pretty important to them and need to be indexed.

I'm fiddling with the CMS code now... it does seem that everything is set to immediat expiry (eg. the classic asp Response.Expires = 0) so I'm going to basically force all downloads (css, js, pdf, ppt, etc. even gif and jpg) to expire in 7 days and then see what happens.

Do you think 7 days is too long? What's the normal/usual document expiry time?

7 days sounds fine, I'd imagine google is seeing that expiration and wanting to redownload since it 'thinks' the document may have changed.

largerabbit · Jun 9, 2009

Hi,

Sometimes it is worth creating and submitting a sitemap to google, these can indicate the frequency of updates to individual pages/content. There is quite a lot of info in the Webmaster Tools section of google.

Might be worth looking at if your bandwidth becomes a problem.

Cheers,

Paul.

antic · Jun 9, 2009

Thanks guys, I'll check out the site map docs, sounds like a good idea. :thumb:

antic · Jun 16, 2009

Stephen, could you investigate this please.. sorry for the long explanation.

I have written a demo script to show this issue in action - will PM links if you need em.

I've been investigating the esoteric and arcane art of interpreting Googlebot test-request headers. And there's an issue with them on the Jodo servers (more likely with HSphere), at least in the way one of my sites works.

Some background info:

Googlebot and other crawlers, when requesting a file from a server, send a special HTTP header, named "if-modified-since". This header is very important for both bots and websites, as it's used to save heaps of bandwidth during crawling. The header contains the time the bot last downloaded the requested file. The server compares that with the file's timestamp on disk and, if the bot needs a fresh copy, the file is served. If the file hasn't since been updated, the server simply sends a "304 Not Modified" with no content, so the bot knows it has the latest copy and moves on.

The problem:

This particular site uses a 404 page to process requests for PDF and other documents. The links to those files don't physically exist - they are handled by a 404 error page which checks user credentials before sending them to the browser. What's happening is that the "if-modified-since" HTTP header is being lost somewhere during the 404 redirection process on the server. All other headers (user-agent, etc) come through fine, but that one goes missing. And it's an important one!

The result is, when the site is crawled, I can't tell if a bot has the latest version or not, so every file is being sent to Google et al, every single time it's requested. Google alone clocks up about 3GB each month downloading the same files over and over when it doesn't need to. However this process works 100% fine on my dev server, where I set up the 404 error document directly in IIS (as opposed to in the HSphere CP).

Sooo.. something on the server - perhaps in the way HSphere applies error documents - is dropping this header off when it sends the request on to the 404 error script. Please can Stephen or someone investigate this? I have written a test script to clearly demonstrate the problem - please PM me for links.

Thanks! :thumb:
(edited to improve my explanation)

antic · Jun 16, 2009

Something I've just discovered... any extra headers I add to the client request go through OK. So if I add a header to my browser request called "foobar", the 404 error script can see it no problem. I can even add "if-modified-since-then" and it works fine.

So something is actively stripping "if-modified-since" from the client headers. Very weird!

Googlebot slurping up bandwidth!

antic

Perch

Stephen

US Operations

antic

Perch

Stephen

US Operations

largerabbit

Perch

antic

Perch

antic

Perch

antic

Perch