(Judging by the posting gap, the author of this blog almost disappeared too! Time to lift the head from the day-to-day scramble and write the next entry!)
The fact that Amazon says up-front that computers fail seems to be the number one concern and criticism of EC2, specially from people who have not used it extensively. I don’t actually spend much time thinking about that because in our experience it’s not something to worry about. It’s essential to take into account when designing a system: whenever we set something up on a machine we immediately think “and what do we do when it fails?” That’s good thing, not a bad thing as anyone with production datacenter experience can attest.
Since it’s such a hot topic, I’ve been keeping a close eye on all the “my instance disappeared” threads on the EC2 forum, and it’s not easy to sort them out. I have no doubt that the vast majority has to do with operator error: * trying SSH and forgetting to open port 22 in the security group (or similarly with other ports) * having difficulties with the SSH keys, or forgetting to set-up a key to begin with * using/constructing an AMI that does not have SSH properly set-up * using/constructing an AMI that does not boot properly (network and/or sshd issues) and failing to look into console output * instance reboot failing, for example disk mounts failing due to mount point changes that were not reflected into init scripts * sshd killed by kernel out-of-memory reaper, failing to look into console output for diagnosis * … and many more
Some of these are beginners failing to read the getting started guide, some are more subtle and can happen even to veteran EC2 users. Then there are emails from Amazon saying “we have noticed that one or more of your instances are running on a host degraded due to hardware failure” and I wonder how many users don’t get these emails because their AWS account’s email address points into a bit bucket.
No doubt there are real failures as well where a host dies and takes the instances with it, or one of the disks used by an instance gives up which is the end of that instance. The question here is how frequent this is relative to the total number of instances running, and since Amazon is so secretive with their numbers it’s really difficult to make even an educated guess. I tried to go back into our year of logs to see whether I could estimate the failure rate, but I don’t have enough data to distinguish failure from shutdown, sigh.
The failures that concern me the most are actually not instance failures but network failures. Anyone having set-up a large datacenter will know that network issues are the most difficult to get under control. The damn network just keeps changing, and as soon as you try to hold still your service providers change stuff. Some of the instance disappearances are really network issues that cause an instance to be unreachable, or unreachable from certain other instances. These are hard to troubleshoot and on more than one occasion I’ve had to run tcpdump on both ends to see packets departing and never arriving. If I can get to the target instance at all to run tcpdump, that is… I hope Amazon gets a better handle onto this type of failure and provides us with better troubleshooting tools. In the meantime, it’s important to flag issues to them so they can troubleshoot and eliminate the root causes.
The really good news is that the Amazon folks are very dedicated to figuring out what’s going wrong and fixing it. So if you have an issue, be sure to do the troubleshooting you can, then set the instance aside, launch a new one to take its place, and post all the details on the forum. Shut the instance down only after the issue is looked-at. Looking back, there were two big issues causing instance termination, one was the day where some EC2 front-end explicitly terminated a bunch of instances by error. Not good, but from what we saw it wasn’t a massive failure either. They clearly have done their best to ensure this doesn’t recur. The other was an instance reboot bug which caused many instances to die in the reboot process. We learned not to reboot ailing instances but instead to relaunch and rescue any data. This issue also seems to be fixed at this point.
To summarize, if you can’t reach an instance, here is what you should do: * try to SSH, check the security group * distinguish SSH timeout from key issues (timeout vs. permission denied type of errors) * use ping to test connectivity (enable ICMP if you have the bad habit of disabling it) * check the console output (use the convenient button on the RightScale dashboard), note that it can take a few minutes for stuff to appear * look at the RightScale monitoring to see whether the instance is still sending monitoring data * hop onto an instance in the same security group and try connecting from there (launch an instance if you don’t have any) * post details (instance ID, what you’ve tried, symptoms observed) on the forum and set instance aside
All in all, the number one lesson is “relaunch!” There are thousands of instances waiting to be utilized so use a fresh one if you see trouble with an existing one. If you master this step you can use it in so many situations: to scale up, to scale down, to handle instance failure, to handle software failure, to enable test set-ups, etc. If you use RightScale you will notice that that’s also what we focus on: making it easier to launch fresh instances.