Mail Server Failure & Recovery

Yash

Bass
As all of you know, we had a major hardware failure on our mail server some 12 hours ago that caused our RAID1 disk array system to become unrecoverable. Due to this issue, we were forced to do a complete recovery of our mail server from backup. 90% of the recovery was complete 6 hours ago and mail services to a majority of customers were restored. For the remaining 10% customers, recovery was complete 30 minutes ago. Those customers that were advised to turn on/off mail services are requested to ignore that advice

During this entire period, our backup mail server (which we setup recently) was running and hence was able to collect a majority of the missed email during this period. That email is going to be piped back into the mail server shortly. Those emails that were not received during this period were bounced back to the senders with technical error messages.

We at JodoHost would like to issue our appologies for this episode of downtime. A RAID array failure is the WORST thing that can happen to a webhost and results in complete loss of the server's harddisks. Restoring tens of thousands of mail boxes on the mail server took longer than expected. For those customers affected, they can write to [email protected] for credit under our SLA.

After this incident, we are working and planning on a new hot-backup and standby mail server that is synchronised with the main server at all times, and possibly load-balanced to ensure 100% reliability of email. We understand that this incident that has hurt our otherwise excellent reputation. We'd like to assure all customers and resellers that we will be taking very strong steps to ensure email always remains reliable with JodoHost in the future.
 
The handfull of domains that customers/resellers were reporting to us have been fixed. The authentication files for those mail boxes were not properly created.

All issues are 100% fixed. Over 1.2 million files, 500,000 emails and 10,000 + mail boxes were restored during this massive restoration effort with our admin team working continuosly for 24 hours.

We do once again appologize for the downtime incident. This has been the most complex restore we have ever performed. We will be going back to the drawing board (after we all get some sleep of course) to rework our mail-server backup/restoration strategy and introducing a hot-backup within the next 24 to 48 hours.
 
Back
Top