TvE 2100

At 2100 feet above Santa Barbara

Surviving a MySQL Master DB Crash

Nothing is more heart-arresting than to find out that your database machine has died. Site down. Data gone. Life s…

That’s what happened to one of our customers yesterday morning, right when they were featured on some prominent sites. The Amazon EC2 instance hosting their master DB died. Fortunately they had tested the master-slave set-up using our Manager for MySQL, so everything was set-up to recover quickly. They IM’d me so I could help should things go wrong. We waited a couple of minutes to see whether the machine was just rebooting, but to no avail. So we hit the “promote to master” on the slave instance, and here’s the log of what happened:

[2007-08-21 16:24:45] [ServerActionsWorker] : Executing: 'Executing action: DB promote to master'
[2007-08-21 16:24:46] [ServerActionsWorker] : Using MasterDB DNS ID: 2577432 .
[2007-08-21 16:24:46] [ServerActionsWorker] : Using SlaveDB DNS ID: 2577433 .
[2007-08-21 16:24:54] [ServerActionsWorker] : No slave argument given...assuming localhost
Using C interface for mysql, client version 5.0.22
Server doesn't appear to be logging binary logs, configuring and restarting server with binary logging
Locking slave (and enabling writes)
[2007-08-21 16:28:04] [ServerActionsWorker] : Process 7927 has the lock. terminating others.
Written read_only changes to new master conf file
Stopping master (if alive), noting position, making RO, stopping and unconfiguring replication
Previously connected master not reachable...
...Warning: assuming old master is dead and that the current contents of the Slave is the latest and best we can get.
Promoting slave...
Waiting until it catches up (if alive), stopping and unconfiguring replication, 
 unlocking tables and setting up replication privileges
Retrieved new master info...File: mysql-bin.000001 position: 98
Stopping slave and misconfiguring master
granting rep rights...
done with rights...
Unlocking tables
Demoting old master...
Changing Master DB DNS...
OK. Result: DNSID 2577432 set to this instance IP:
Mission accomplished.
[2007-08-21 16:28:04] [ServerActionsWorker] : Server action successully completed

Woot! The slave promoted to master just fine. At that point we had to bounce the Mongrel servers because, as far as we can tell, ActiveRecord just doesn’t switch to the new DNS entry for the DB in any reasonable amount of time. After verifying that the site was back up and fixing an ancilliary server that wasn’t pointing to the proper database DNS entry, we laucnhed a fresh slave with another button press.

MySql after failure

Phew, all this within about a half hour, including initial reaction and troubleshooting time and follow-up cleanup work. Everything we put in place with Manager for MySQL worked like a charm!