Two Words - Disaster Recovery (DR)

Status
Not open for further replies.

tr1stan

Perch
Although this applies to a million current threads at the moment, I felt that it was important enough to have its own.

I can?t believe that JH does not have any form of disaster recovery, and if you do?..you certainly cannot implement it very well!

I am the webmaster for a University here in the UK, and have the responsibility of looking after about 15 separate web servers ? Now that doesn?t make me a guru, but I do have experience in hosting important websites, and keeping the customers happy.

It is every Webmaster?s nightmare, getting that call from the helpdesk telling you that a certain web application cannot be seen, or a number of websites are currently down. Suddenly you think to yourself, ?I hope it?s not a corrupt file system? or ?Wouldn?t it be great if our new raid array has decided to fall over again?. But none of this is hugely detrimental as we have DR procedures in place that reduces the amount of possible downtime by a large amount.

Why is it JH that you do not have an unused server, for each platform you run, that is just sitting around with the OS and web services installed, waiting for a particular disaster to occur. Wouldn?t it have been great for you to have been able to do a restore to the DR box the moment you new that there was even a hint of excessive downtime, run them parallel for a moment ? giving you a chance to make sure everything was 100% on the new box - and then make the switch. The only major downtime that is experienced is the time it takes to switch the servers?.which if done correctly should be no where near 15 minutes, let alone 2 ? 3 hours, or what ever you guys ?estimated?.

I was lucky in the fact that my sites are not hosted on WIN5, but that doesn?t make me any less uneasy about the whole situation.

I understand JH might try and use the excuse ?But that would cost us too much in hardware? but I really think you need to have a look at how many of your customers and your customers customers are losing patience with your level of service. It?s not like you have to buy a new server per disaster, as the server that was causing the problem in the first place, gets turned into a DR server after it has been swapped out and sorted.
I also haven?t mentioned the amount of overtime you might have to pay your staff (assuming that technicians stay on a job until it?s finished) or the amount of compensation heading in your customers direction.

It?s about being prepared for the worst and planning ahead?..try it, it might stop you going grey too early!

T
 
Oh no, we definately have had a standby server. Infact we had a Win2003 and RedHat standby server available but not Win2000. The win2000 box was actually absorbed into our backup servers and more dual xeon boxes were ordered. After this we are going to be making sure we always have boxes sitting out, that has never been a problem though

The OS install took less than 30 minutes, that wasn't the issue either.

For disaster recover, we had recently introduced a new backup system that would allow us to restore complete hard disks within short periods of time. Unfortunately, it wasn't fully tested (we only just introduced it!) when this occured. We did however give it a shot and unfortunately it could not work with the RAID drivers

Cost is never an issue here. Yes, mistakes were made in this incident and lessons learned.
 
I'd really like to see JH offer dedicated servers. I for one would love to pay extra $ for this service. No Hsphere though please ;)
 
More Smoke and Mirrors... At this point, with the fact that my site has been down for well over 24 hours, you'd be hard pressed to convince me that you have a disaster plan, backup servers, etc. Where are they then? Instead it's yet another excuse... Might as well tell me the dog ate it, it's just about as believable...

edit - On further looking around I found this stuff on the JH site... I find the bold points interesting...

Superior Hardware
Our servers run only Dual Xeon or Pentium 4 (Hyper-Threaded) processors for maximum efficiency. A minimum of 2GB of server-grade DDRAM is fitted into each of these machines. Our web servers run top-of-the-line Ultra320 SCSI hard disks (10 times more efficient than ordinary SATA hard disks) in a RAID1 configuration. This means that 2 SCSI hard disks are setup in an array which remain an exact mirror of each other at all times. If one hard disk fails, the other picks up immediately ensuring maximum reliability. All our servers are backed up to secure NAS servers every 24 to 48 hours. This along with RAID1 ensures that your data is more safer with us than on your hard disk! We use a powerful internal CISCO firewall to maintain network security and protect our systems from harmful internet traffic.

Things that make ya go, hmmmm...
 
Coldfusionman,
We never lost any data

We made absolutely no mistake in the recovery, it was the fastest our disaster recovery plan currently supported.

This was not a hard-disk failure. It was a major software failure
We however do take responsibility. We introduced our new backup scheme a few days before this actually happened. We will be putting that to full swing now

EDIT: if a hard disk failed, you wouldn't even notice it!
 
Yash said:
Coldfusionman,
We never lost any data....EDIT: if a hard disk failed, you wouldn't even notice it!
Restoring data from days or weeks previous to the crash is considered data loss from a customer perspective. The files youy put back on my site were days old -- thank God I maintain my own code locally. You replied to another customer in the Forum here somewhere that their latest files were overwritten by a script so you restored from a backup days or weeks older. Unless you can fully restore data to the minute it was lost (which is very difficult without clusters and mirrors) or at the very least, the night before (which is accepted as the industry standard) then it's considered data loss. In fact restoring from the night before is also considered data loss but usually at an accepted rate.

Which brings me to the 10-millionth suggestion on how to run your business (OK, so I exagerate a bit). Plan recovery tests. Taking a backup is useless unless you can restore from it. Recovery tests should be performed on a regular basis (weekly or monthly are good). If the concept is foreign to you, let me explain. You take last night's backup, some spare hardware similar if not identical to production, and you restare the data from the backup media. You then compare the server with the backup media and confirm that all has been restored. You then do functional testing to ensure that the files resotred are usable. If you can compare the resotred data to the live data this is good too, but this is often impossible since the live data has since changed.

This takes effort, but the cost of lost data and lost reputation is far more.

Do you guys have any experience in managing an enterprise? And don't quote me no MSCEMOUSE crap. Microsoft doesn't teach you how to run an enterprise, they teach you (if you call it that, its more like brainwashing IMHO) how to operate their software. Running an enterprise (and when I say this I mean servers and network in a server farm (or more than one) ensuring all the processes, procedures, and equipment are setup to ensure reliability and continued quality of service to the customer) is only learned by experience and not in a classroom. I'm convinced that you guys don't have this experience. It is abundantly clear that many of your customers do as they have been in the industry for as long as you claim to be, or more (myself included -- managing my first server farm more than 10 years ago)

How does it feel to know less than your customers?

EDIT: Oh, and yeah, about that hard disk crash. Last time it happened, it was on win5 and we did notice. I know, I know, it wasn't your fault, it was a faulty RAID controller. Tiem will tell if you can restore your reputation, words will not.

EDIT: Sorry for the lengthy post, it wasn't my original intention, but once I got going it was hard to stop. I've stopped myself now.
 
1) We do recovery tests locally
2) The issue was not with recovery, we did that fine. Almost all accounts were restored with files that were less than 12 hours old. There were 15 accounts that had been corrupted by the PSOFT. We promptly recovered from backup
3) We know how to run an enterprise. We have people with alot of experience working for us. Atul has headed some large multi-million dollar businesses in the past so I'm not sure what you mean

We understand this has dented our reputation. We will show progress with action right after this. We are going to be creating a new crisis team, publishing and monitoring our backup procedures and ensuring disaster recovery is practised on a more frequent basis
 
Yash said:
We understand this has dented our reputation. We will show progress with action right after this. We are going to be creating a new crisis team, publishing and monitoring our backup procedures and ensuring disaster recovery is practised on a more frequent basis
I don't want to see more words. I want to see actions, and an improved track record. Time will tell if I'm here to see it.

In the words of Nike, Just Do It.
 
I am sorry, but I must chime in here. If you don't like the service then find somebody else!

I don't get it. Do you really think that the *****ing and moaning will make JodoHost change anything?

I for one understand that no body and no company is perfect. JodoHost you do an excellent job and I know you have mistakes, but I have seen you do a wonderful job admiting your mistakes and working to prevent them in the future. Keep going!!

Adam
 
I agree. Although I am not on Win5 (am on Win6), the overall service here at JodoHost has been very good and I don't think one incident at JodoHost should mean such heavy criticism
 
Uhm, I have been part of 3 major issues in a year...

Not complaining though, I too appreciate all the time and work that Yash and the Gang put in! Just have some room for improvement (as does most things)...
 
I agree JH is a great host with great pricing. Cmon the fact that Yash takes the time to post replies here is a tribute to this. These guys are really putting in the time. But lets remeber the squeky wheel gets the oil too.



atomi
 
Status
Not open for further replies.
Back
Top