Web 10 Sites up and down for several days

dman

Perch
I'm having issues with all my sites on web 10. They have been going up and down for 5-45 minutes at a time 10-12 times for the last 48 hours. I have a monitor that tells me if the site goes down and then I've checked it manually on various connections and it is unresponsive. I've been working with a support ticket (AHW-33854-987) but they don't seem to see the outages when they check. Any idea what is happening on web 10? Thanks!
 
Little more info... One of this sites had issues with IP address usage instead of host name for a DB but that was resolved several weeks ago and I can't find any Ip references in the code. The one site with a monitor shows the following since 9/4/2011:

09/07/2011 01:04 AM (PDT) 13 min(s)
09/06/2011 08:09 PM (PDT) 2 min(s)
09/06/2011 07:27 PM (PDT) 4 min(s)
09/06/2011 05:39 PM (PDT) 32 min(s)
09/06/2011 03:40 PM (PDT) 4 min(s)
09/06/2011 02:14 PM (PDT) 9 min(s)
09/06/2011 01:32 PM (PDT) 2 min(s)
09/06/2011 11:08 AM (PDT) 3 min(s)
09/06/2011 07:09 AM (PDT) 7 min(s)
09/06/2011 03:11 AM (PDT) 10 min(s)
09/06/2011 01:52 AM (PDT) 13 min(s)
09/06/2011 01:01 AM (PDT) 4 min(s)
09/05/2011 11:08 PM (PDT) 15 min(s)
09/05/2011 10:07 PM (PDT) 3 min(s)
09/05/2011 09:07 PM (PDT) 13 min(s)
09/05/2011 03:10 PM (PDT) 21 min(s)
09/05/2011 02:09 PM (PDT) 14 min(s)
09/05/2011 01:08 PM (PDT) 20 min(s)
09/05/2011 11:30 AM (PDT) 6 min(s)
09/05/2011 10:17 AM (PDT) 3 min(s)
09/05/2011 07:03 AM (PDT) 1 hr(s) 8 min(s)
09/05/2011 06:16 AM (PDT) 5 min(s)
09/05/2011 03:46 AM (PDT) 35 min(s)
09/05/2011 01:10 AM (PDT) 1 hr(s) 5 min(s)
09/04/2011 11:12 PM (PDT) 56 min(s)

The other site is not even live yet and does not use a local DB.

Hiccups, flapping nic card, or other issue the down alerts are bugging me constantly for two days. There has to be some issue on this server. Any help is appreciated!
 
this may be purely your site now, as with the migration weve moved to a more isolated means of running php, apache etc, where one error doesnt impact all like it used to, unless load goes really high, however the patterns you see there arent consistent to load high, and look more like site level apache load issues
 
this may be purely your site now, as with the migration weve moved to a more isolated means of running php, apache etc, where one error doesnt impact all like it used to, unless load goes really high, however the patterns you see there arent consistent to load high, and look more like site level apache load issues

Hey Stephen,

I have a development site on this server using a instant access domain alias as well and it is effected each time the live site fails. So it isn't just one site. The live site has gone down 7 time today for 3-15 minute intervals according to my monitor that checks the site every 2-3 minutes and I have verified this with a manual check. These do appear to be occurring in 1-2 hour intervals at all times of the day and night as well which probably wouldn't be a traffic or load issue specific to the site. I've seen this type of thing with a bad NIC card on other servers. I've updated my ticket as well.

This only started Monday morning... Did something happen this weekend in terms of migrations? Anyone else having issues on web 10? Could this be some sort of DOS attack? Any help is appreciated. Thanks!
 
This only started Monday morning... Did something happen this weekend in terms of migrations? Anyone else having issues on web 10? Could this be some sort of DOS attack? Any help is appreciated. Thanks!
I can thankfully answer this NO!

this week I was able to rest a little, on hardware, and it is good as I was downright sick on sunday/monday

I had given staff off until tuesday lunch, that turned out to be a mistake :D As I had to do a hardware recovery (that advanced, I'd have been in anyway)

There is/was a MAJOR apache bug out that we got notice of and have done a temp fix to prevent DDOS from leaving server until an official patch is out, it is a security rule that puts some limits on Apache, I have a feeling it may be causing this, it was a pretty major bug and being exploited all over, so it has to be patched. but I cannot be sure that is THE CAUSE I will mention it to the linux team and have them investigate that.
 
Little more info... One of this sites had issues with IP address usage instead of host name for a DB but that was resolved several weeks ago and I can't find any Ip references in the code. The one site with a monitor shows the following since 9/4/2011:

09/07/2011 01:04 AM (PDT) 13 min(s)
09/06/2011 08:09 PM (PDT) 2 min(s)
....

What are you using to monitor your sites?

I've tried a couple - both services and downloads- but they didn't supply an organized log like the one you posted here.
 
What are you using to monitor your sites?

I've tried a couple - both services and downloads- but they didn't supply an organized log like the one you posted here.

Bunchadogs - I use Pingdom.com, Binarycanary.com, and others. That particular log is from Binarycanary.com's report feature. I cleaned it up a little but they have very affordable services and great functionality including content monitoring.
 
I can thankfully answer this NO!

this week I was able to rest a little, on hardware, and it is good as I was downright sick on sunday/monday

I had given staff off until tuesday lunch, that turned out to be a mistake :D As I had to do a hardware recovery (that advanced, I'd have been in anyway)

There is/was a MAJOR apache bug out that we got notice of and have done a temp fix to prevent DDOS from leaving server until an official patch is out, it is a security rule that puts some limits on Apache, I have a feeling it may be causing this, it was a pretty major bug and being exploited all over, so it has to be patched. but I cannot be sure that is THE CAUSE I will mention it to the linux team and have them investigate that.

Hey Stephen,

So any idea when this will stop? Below is a log of the monitors from the last 24 + hours:

09/08/2011 08:01 AM (PDT) 20 min(s)
09/08/2011 05:47 AM (PDT) 1 min(s)
09/08/2011 04:13 AM (PDT) 5 min(s)
09/08/2011 01:48 AM (PDT) 18 min(s)
09/08/2011 01:07 AM (PDT) 3 min(s)
09/07/2011 09:55 PM (PDT) 3 min(s)
09/07/2011 09:04 PM (PDT) 14 min(s)
09/07/2011 07:08 PM (PDT) 2 min(s)
09/07/2011 05:43 PM (PDT) 12 min(s)
09/07/2011 03:59 PM (PDT) 13 min(s)
09/07/2011 03:05 PM (PDT) 4 min(s)
09/07/2011 01:26 PM (PDT) 3 min(s)
09/07/2011 07:15 AM (PDT) 5 min(s)
09/07/2011 03:23 AM (PDT) 9 min(s)
09/07/2011 01:57 AM (PDT) 3 min(s)
09/07/2011 01:04 AM (PDT) 15 min(s)

I haven't gotten a reply to my tickets since Tuesday or very early Wednesday and it kinda seems like no one clearly knows why this is occurring. I have a new site that I want to launch on that service plan but it will look bad to the client if the site is going down frequently.

Are you not seeing any issues with this server that could cause this problem? Is this a problem on all Linux/Apache servers? Should I spend the time to migrate the failing site and new site to another service plan with a different server in the cluster? What can be done? Thanks!
 
I am trying to get these answers, I was a bit under weather and some urgent matters personally to attend the last 2 days that have kept me a bit more out of the loop than I'd wish.
 
Thanks - I'll check these out!

I feel like I used Pingdom before, but they didn't have the mobile app support or features they list now.

I just looked back at Uptrends and Site24x7 - they were both pretty lacking when I used them, but it looks like everyone has made some major improvements in their services.
 
I am trying to get these answers, I was a bit under weather and some urgent matters personally to attend the last 2 days that have kept me a bit more out of the loop than I'd wish.

I totally understand, I was under the weather for a month and a half this summer. Hope you feel better soon.

When you can, please let me know an ETA for a fix or if I need to consider other options like moving to another server that doesn't have this issue. For the latter, It would be a little time consuming and I just need to be sure it will not be an issue on other Apache servers if it is the temp bug fix you mentioned causing the problem.
 
Thanks - I'll check these out!

I feel like I used Pingdom before, but they didn't have the mobile app support or features they list now.

I just looked back at Uptrends and Site24x7 - they were both pretty lacking when I used them, but it looks like everyone has made some major improvements in their services.

Glad to help! I'm using BinaryCanary mostly due to price and the feature set. The others I looked at were expensive for some notifications and either didn't have or charged more for content monitoring. I'm not sure about mobile app support though and not sure I really need it. It just sends me the alerts so I know if an outage is persisting or has been resolved. I'll check out the others you mentioned as well.
 
...and the site just went down again... that's twice in the last hour. Hope a fix comes soon. Showing a 95.191 Uptime % currently...
 
We have done some changes just today. As for your domain do you have a ticket in about it? Also, do you have error logs enabled?
 
We have done some changes just today. As for your domain do you have a ticket in about it? Also, do you have error logs enabled?

Hi Tanmaya,

Ok, hopefully the changes have helped... I do have a ticket about it with the specific domains effected, ticket # AHW-33854-987.This is also occurring on more than one domain so I'd be surprised if it was domain or web site specific issues.

I do have error logs turned on and most of the items I see are 404 errors for misc items that should not effect the entire domain. But, I'm not clear if the errors are logged at the same time the web sites have gone down. A few of the errors that were logged in the last 3-4 days that are not 404 are shown below:

Directory index forbidden by Options directive:
(4)Interrupted system call: FastCGI: comm with server
File does not exist: /xxx/xxx/xxx/xxx/styles/macFFBgHack.png
FastCGI: incomplete headers (0 bytes) received from server "/xxx/xxx/php5/bin/php"
(104)Connection reset by peer: FastCGI: comm with server "/xxx/xxx/php5/bin/php" aborted: read failed

Let me know if any of this helps. Thanks!
 
I'm not saying a website specific issue. I just needed domains/logs where I can get to the issue quickly. Can you please see which page was being accessed when this error occurred, and what best can be done to improve response time of the page?
For example, quite a few customers were using direct IPs, and with old IPs down now, their scripts timed out.
 
I'm not saying a website specific issue. I just needed domains/logs where I can get to the issue quickly. Can you please see which page was being accessed when this error occurred, and what best can be done to improve response time of the page?
For example, quite a few customers were using direct IPs, and with old IPs down now, their scripts timed out.

Hey Tanmaya,

The web site monitor checks the home page only for www.abetterxxx.com. I have also had the site open in a browser and none of the pages will load when this issue is occurring so I don't believe it is page specific either. The home page is a little heavy at 300kb+ with JQuery, Ajax, AddThis and other scripts, and could be optimized some but the site had no issues until this Monday. No changes to the actual pages have occurred in several months either. Do you think this is an issue with the response time? Is there some throttling that will occur if the page size is too large?

I did have the direct IP address issue two weeks ago which caused an hour or two of downtime but I quickly corrected it by using host names instead. This shouldn't be an issue now.

I haven't received a down alert for a couple of hours now so I'm hopeful that whatever changes or adjustments were made have corrected the issue. If you have any other specific suggestions for me to look at, let me know. Thanks!
 
Web 10 still seems to be experiencing periodic outages every day. It was down for a total of two hours today, 1 hour yesterday and 2 hours the day before. Every time I submit a ticket I just get a response that it is back online now but no details are provided and the issues continue. I've got a new site I was going to add to the this reseller plan but can't with these ongoing issues.

Any updates on the cause of the ongoing hour long outages on web10? What's the best way to move a client from plan to another so they can get on another server, hopefully with less issues?
 
I'm still seeing multiple outages daily on WEB 10. I've submitted tickets (AHW-33854-987) but there is still no resolution and the support staff doesn't seem to know what is going on with this server. The sites on this server go down for 15-60 minutes multiple times a day. Will this ever end?!

I am also now experiencing severe slowness and server time outs while trying to FTP to WEB 10. I've been trying to move an image folder totaling 2.7 MB for an hour and I getting "waiting for server" and time out responses. I've tried this using local copies and copying it directly on the server with the same results. What is causing this issue?

Please let me know what is going on with WEB 10 and if there is any resolution forthcoming. Are these issues specific to WEB 10? If I move the sites to another Linux/Apache server will it be corrected? Please respond!!! Thanks!
 
the unix servers sites now work in an isolation mode, so unless there is a serverwide issue and we post about it (or last less than 2 minutes and we don't get to a point to post as ti is fixed), it is generally isolated just to a site.
Unix isn't otherwise my field, and I don't know all about it to answer in more detail to you.
I do know that the isolated mode is a bit slower than the integrated php mode of before, but it is more secure on the server overall. I know it can cause issues sometimes with certain requests, but now for the most part, items should not be blocked by mod security either.
 
Back
Top