antic
Perch
I spotted a client of mine using 2-3GB of bandwidth each month which seemed very unusual - they shouldn't have so much traffic! I checked AWStats to make sure HSphere was reporting correctly, and noticed something interesting for the current month...
Bots like Googlebot, MSN and Yahoo's Slurp were using a heap of it:
Googlebot = 751 MB
MSNBot = 188 MB
Yahoo Slurp = 108 MB
TOTAL = 1047 MB
Under the breakdown of traffic by file type, Adobe Acrobat is racking up 742 MB of the downloads. Google indexes PDFs so that coincides with the Googlebot usage. However the client only has about 380 MB of PDF files stored in total. So there must be a lot of downloading going on of the same files over and over again.
This client uses a CMS which delivers the files to the browser, using dynamic URLs. Perhaps the CMS is making every file "expire" immediately, causing the bots to re-download them every time they crawl the site?
Is there an easy way to tell bots not to download PDF files? That would do the trick for now, while I investigate document expiry...
Any other suggestions?
Bots like Googlebot, MSN and Yahoo's Slurp were using a heap of it:
Googlebot = 751 MB
MSNBot = 188 MB
Yahoo Slurp = 108 MB
TOTAL = 1047 MB
Under the breakdown of traffic by file type, Adobe Acrobat is racking up 742 MB of the downloads. Google indexes PDFs so that coincides with the Googlebot usage. However the client only has about 380 MB of PDF files stored in total. So there must be a lot of downloading going on of the same files over and over again.
This client uses a CMS which delivers the files to the browser, using dynamic URLs. Perhaps the CMS is making every file "expire" immediately, causing the bots to re-download them every time they crawl the site?
Is there an easy way to tell bots not to download PDF files? That would do the trick for now, while I investigate document expiry...
Any other suggestions?