Yash, in April you posted "Sure... I do not take anything as an attack against us. We had these 2 failures on Win5 we accept it and take responsibility for it.
We use a system called RAID1 on our Windows servers and raid5 on our linux servers. RAID1 and RAID5 are disk arrays that can sustain the loss of a disk without the loss of any data. For example, our win servers have 73GBx2 SCSI Hot-swap hard disks. x2 mean that there are two hard disks on the system. These 2 hard disks are exact mirrors of each other. This mirroring is done by the help of a RAID1 controller from adaptec. If one hard disks fails, there would be no data loss and the system would continue to run.
Apart from this system, we take full system backups every 3 to 4 days and move them to a remote server for safe storage.
We plan to introduce a new system that would help us to increase uptime. This system is based on a secondary server concept where we have a secondary server to take over if the primary server (for example win5) fails without any downtime or data-loss. A program would run on the primary server that would monitor the primary server for any file changes and immediately update with secondary server with the latest files. Therefore, the secondary server would theoratically have an exact mirror of all data on the primary server at all times, and would be ready to take over from the primary server at any time, automatically
We are hoping we can deploy this system in the middle of June or earlier"
In August you posted about new backup procedures at
http://support.jodohost.com/showthread.php?t=2644 which talks about hot-swappable disks including Win5.
You actually mentioned "Through our new backup strategy, we keep a complete mirror of the server's hard disk on a separate SCSI disk which is kept off-site. If a server goes down with a OS or system crash, recovering the server would simply involve swapping the current hard disk with its mirror." after commenting "The main issue with these forms of backup is recovery. After a system crash or OS corruption, we have to prepare a fresh system. Recreating the resources of over 1000 sites on a fresh standby server can take a good hour or two and then restoring the data can take another 6 to 8 hours at least (although the entire process is automated)."
More recently another post explained backups to be on a remote datacentre. So you have been accutely aware for quite some time of the need for backups and speedy data-recovery as close as possible to immediate.
The question is why the August plan was not implemented as this would have minimised downtime as explained in that post and all the agony?