Posts older than this one are from the blog.voneicken.com and blog.rightscale.com pre-2009 archive. Someday I’ll start blogging again on my own, but for now my blogging energy will remain focused on RightScale
Today we released version 1.5.0 of our RightAws gem at http://rubyforge.org/projects/rightaws/ ! This release not only provides an interface to EC2, S3, and SQS but also to SDB! The SDB interface has all the nice features of the other interfaces, such as persistent HTTP connections and error retries. Please give it a spin and let us know if you find a problem!
This release also has full support for European (EU) buckets in Amazon S3 and it works properly with objects larger than 2GB (it supports the Expect: 100-continue stuff). Note that there is a slight incompatibility with attachment_fu: if you’re using attachment_fu and RightAws you need to load the latter last, otherwise you’ll get an error message and breakage.
There are many definitions for cloud computing around, and depending on the time of day I’d probably recite a different one. But what’s really defining to me is the availability of quasi endless computers (instances in EC2 terminology). Pricing model, pay as you go, someone else deals with muck: all great stuff. But endless computers, that changes computing. Why? Because launching a new instance becomes the one cure to many problems.
Whatever happens that is uncool, the answer becomes: launch a fresh instance! (Or two while you’re at it.) Whether it’s an instance failing, a hard drive starting to give out, or excessive memory errors. Or the load going up and about to max out the existing servers you have. Or the installation of a new version of your application and you’d rather bring it up and test it before cutting production over. Or the network not behaving cleanly because of some failures or other issues. Or your latest app upgrade causing a slow-down ‘cause you overlooked a slow DB query. Etc, etc, etc. In all these cases, if you’ve prepared the ability to launch a new instance quickly and cheaply (labor-wise), you’re out ahead. That’s cloud computing!
So get comfortable and set up your tools to make the launching easy. Then everything else follows. This is of course what we’re enabling at RightScale, but even if you decide not to use RightScale it’s the way to go.
By now everyone has realized how game-changing a product has introduced with EC2. The business case is crystal clear for many organizations, small and large. What has not yet become as well understood is the software gap that EC2 has exposed. Basically it opens new opportunities at the hardware level but for most organizations there is a software gap that they need to close to actually be able to take advantage of EC2. But let’s start at the beginning…
The business case typically has to do with the ability to scale up quickly on demand, for example due to a media event, or to scale up&down with daily use to
(Judging by the posting gap, the author of this blog almost disappeared too! Time to lift the head from the day-to-day scramble and write the next entry!)
The fact that Amazon says up-front that computers fail seems to be the number one concern and criticism of EC2, specially from people who have not used it extensively. I don’t actually spend much time thinking about that because in our experience it’s not something to worry about. It’s essential to take into account when designing a system: whenever we set something up on a machine we immediately think “and what do we do when it fails?” That’s good thing, not a bad thing as anyone with production datacenter experience can attest.
Since it’s such a hot topic, I’ve been keeping a close eye on all the “my instance disappeared” threads on the EC2 forum, and it’s not easy to sort them out. I have no doubt that the vast majority has to do with operator error: * trying SSH and forgetting to open port 22 in the security group (or similarly with other ports) * having difficulties with the SSH keys, or forgetting to set-up a key to begin with * using/constructing an AMI that does not have SSH properly set-up * using/constructing an AMI that does not boot properly (network and/or sshd issues) and failing to look into console output * instance reboot failing, for example disk mounts failing due to mount point changes that were not reflected into init scripts * sshd killed by kernel out-of-memory reaper, failing to look into console output for diagnosis * … and many more
Some of these are beginners failing to read the getting started guide, some are more subtle and can happen even to veteran EC2 users. Then there are emails from Amazon saying “we have noticed that one or more of your instances are running on a host degraded due to hardware failure” and I wonder how many users don’t get these emails because their AWS account’s email address points into a bit bucket.
No doubt there are real failures as well where a host dies and takes the instances with it, or one of the disks used by an instance gives up which is the end of that instance. The question here is how frequent this is relative to the total number of instances running, and since Amazon is so secretive with their numbers it’s really difficult to make even an educated guess. I tried to go back into our year of logs to see whether I could estimate the failure rate, but I don’t have enough data to distinguish failure from shutdown, sigh.
The failures that concern me the most are actually not instance failures but network failures. Anyone having set-up a large datacenter will know that network issues are the most difficult to get under control. The damn network just keeps changing, and as soon as you try to hold still your service providers change stuff. Some of the instance disappearances are really network issues that cause an instance to be unreachable, or unreachable from certain other instances. These are hard to troubleshoot and on more than one occasion I’ve had to run tcpdump on both ends to see packets departing and never arriving. If I can get to the target instance at all to run tcpdump, that is… I hope Amazon gets a better handle onto this type of failure and provides us with better troubleshooting tools. In the meantime, it’s important to flag issues to them so they can troubleshoot and eliminate the root causes.
The really good news is that the Amazon folks are very dedicated to figuring out what’s going wrong and fixing it. So if you have an issue, be sure to do the troubleshooting you can, then set the instance aside, launch a new one to take its place, and post all the details on the forum. Shut the instance down only after the issue is looked-at. Looking back, there were two big issues causing instance termination, one was the day where some EC2 front-end explicitly terminated a bunch of instances by error. Not good, but from what we saw it wasn’t a massive failure either. They clearly have done their best to ensure this doesn’t recur. The other was an instance reboot bug which caused many instances to die in the reboot process. We learned not to reboot ailing instances but instead to relaunch and rescue any data. This issue also seems to be fixed at this point.
To summarize, if you can’t reach an instance, here is what you should do: * try to SSH, check the security group * distinguish SSH timeout from key issues (timeout vs. permission denied type of errors) * use ping to test connectivity (enable ICMP if you have the bad habit of disabling it) * check the console output (use the convenient button on the RightScale dashboard), note that it can take a few minutes for stuff to appear * look at the RightScale monitoring to see whether the instance is still sending monitoring data * hop onto an instance in the same security group and try connecting from there (launch an instance if you don’t have any) * post details (instance ID, what you’ve tried, symptoms observed) on the forum and set instance aside
All in all, the number one lesson is “relaunch!” There are thousands of instances waiting to be utilized so use a fresh one if you see trouble with an existing one. If you master this step you can use it in so many situations: to scale up, to scale down, to handle instance failure, to handle software failure, to enable test set-ups, etc. If you use RightScale you will notice that that’s also what we focus on: making it easier to launch fresh instances.
RightScale teams with best-of-breed technology and service companies to help our customers realize the full potential of web-scale solutions. If you’re interested in partnering, contact us at firstname.lastname@example.org.
Amazon Web Services
Amazon Web Services provides a suite of solutions that enable organizations to leverage Amazon.com’s robust technology infrastructure and content via simple API calls. Using services such as Amazon EC2, Amazon S3, and Amazon SQS, organizations can cut fixed costs by letting Amazon do the “heavy lifting” of building and managing their infrastructure. Amazon Web Services helps organizations focus on their idea and develop web applications in a reliable, scalable, and cost-effective manner.
For application providers that want to accelerate license growth, expand into new markets, and reduce support and development costs, rPath’s platform transforms applications into virtual appliances. A virtual appliance is an application combined with just enough operating system (JeOS) for it to run optimally in any virtualized environment. Virtual appliances eliminate the hassles of installing, configuring and maintaining complex application environments. Only rPath’s technology produces appliances in multiple virtual machine formats, simplifies application distribution, and lowers the customer service costs of maintenance and management. For more information, see www.rpath.com
ELC Technologies is powering the hottest Ruby on Rails web applications. From NASCAR to TuneCore to Live Nation, ELC helps to bring Ruby on Rails to great brands and start-ups. The team has years of technical experience and more than 25 years of Rails experience, including as Rails Studio Alumni, Scale with Rails Alumni, Advanced Rails Studio Alumni, Enterprise Rails Studio Alumni, and a RailsConf Sponsor in the US and Europe. ELC offers a top Ruby on Rails team, tuned in to the latest web technologies and aesthetics, with a broad palette of agile processes and practices to help your project succeed.
Atlantic Dominion Solutions
Atlantic Dominion Solutions is a leading-edge web development firm that specializes in using Ruby on Rails to build advanced, scalable, database-backed applications for organizations of all sizes. Based on custom requirements, ADS creates forward-thinking and user-focused applications that can be deployed quickly, scaled easily, and maintained with minimal time or effort. ADS is a vanguard in enhancing Rails platforms with Amazon Web Services and, along with visionary web applications, ADS also provides data warehousing and mining, graphic and interactive design, business and technical consulting, Internet marketing and advertising, and search engine optimization services.
What is the expected network performance between Amazon EC2 instances? What is the available bandwidth between Amazon EC2 and Amazon S3? How about in and out of EC2?…These are common questions that we get very regularly. While we more or less know the answers to them, out of our own experience in the past 15 months, we haven’t really conducted a clean experiment to put some more precise numbers behind them. Since it’s about time, I’ve decided to do some ‘informal’ experiments to measure some of the available network performance around EC2.
Before we start, though, a few warnings (disclaimer): Like in the drug commercials…Results may vary! :) . The results presented here use a couple of EC2 instances and therefore should only be interpreted as “typical/possible” results. The only claim that we make here is that, these are the results we got, and therefore we expect that perhaps this is an indication of available performance at this point. Amazon can make significant hardware and architectural changes which could greatly alter the results (hopefully only making them better ;) )
Let’s start with some experiments to measure the performance from EC2 instance to instance.
Performance between Amazon EC2 instances
In this first experiment I boot a couple of EC2 large instances. I make one of them a ‘server’ by setting up apache and copying some large files into it, and use the other instance as a client by issuing http requests with curl. All file transfers are made out of memory cached blocks, so there’s virtually no disk I/O involved in them.
So, this experimental setup consists of: * 2 Large EC2 instances: * Server: Apache (non-SSL) serving 1-2GB files (cached in memory) * Client: curl retrieving the large files from the server
These two instances seem to be actually separated by an intermediate router…so they don’t seem to be on the same host. This is the traceroute across them:
traceroute to 10.254.39.48 (10.254.39.48), 30 hops max, 40 byte packets 1 dom0-10-254-40-170.compute-1.internal (10.254.40.170) 0.123 ms 0.102 ms 0.075 ms 2 10.254.40.2 (10.254.40.2) 0.380 ms 0.255 ms 0.246 ms 3 dom0-10-254-36-166.compute-1.internal (10.254.36.166) 0.278 ms 0.257 ms 0.231 ms 4 domU-12-31-39-00-20-C2.compute-1.internal (10.254.39.48) 0.356 ms 0.331 ms 0.319 ms
So what are the results we got? Well, using 1 single curl file retrieval, we were able to get around 75MB/s consistently. And adding additional curls uncovered even more network bandwidth, reaching close to 100MB/s. Here are the results:
- 1 curl -> 75MB/s (cached, i.e., no I/O on the apache server)
- 2 curls -> 88MB/s (2x44MB/s) (cached)
- 3 curls -> 96MB/s (33+35+28 MB/s) (cached)
I did not repeat the experiments using SSL. However, I did some additional tests transferring files using ‘scp’ across the same instances . Those tests seem to max out at around 30-40MB/s regardless of the amount of parallelism as the CPU becomes the bottleneck.
This is really nice: basically we’re getting a full gigabit between the instances! Now, let’s take a look at what we get when EC2 instances talk to S3.
Performance between Amazon EC2 and Amazon S3
This experiment is similar to the previous one in the sense that I use curl to download or upload files from the server. The server, however, is s3.amazonaws.com. (Still using HTTP and HTTPS since S3 is a REST service).
So, this experimental setup consists of: * 1 Large EC2 instance: * curl to retrieve or upload S3 objects to/from S3 * Amazon S3: i.e., s3.amazonaws.com * serving (or storing) 1GB files
The trace to the selected s3 server looks like:
traceroute to s3.amazonaws.com (184.108.40.206), 30 hops max, 40 byte packets 1 dom0-10-252-24-163.compute-1.internal (10.252.24.163) 0.122 ms 0.150 ms 0.209 ms 2 10.252.24.2 (10.252.24.2) 0.458 ms 0.348 ms 0.409 ms 3 othr-216-182-224-9.usma1.compute.amazonaws.com (220.127.116.11) 0.384 ms 0.400 ms 0.440 ms 4 othr-216-182-224-15.usma1.compute.amazonaws.com (18.104.22.168) 0.990 ms 1.115 ms 1.070 ms 5 othr-216-182-224-90.usma1.compute.amazonaws.com (22.214.171.124) 0.807 ms 0.928 ms 0.902 ms 6 othr-216-182-224-94.usma1.compute.amazonaws.com (126.96.36.199) 151.979 ms 152.001 ms 152.021 ms 7 188.8.131.52 (184.108.40.206) 2.050 ms 2.029 ms 2.087 ms 8 220.127.116.11 (18.104.22.168) 2.654 ms 2.629 ms 2.597 ms 9 * * *
So, although the server itself doesn’t respond to ICMPs, the trace tells that there’s a significant path to be traversed.
Let’s start with downloads, more specifically with HTTPS downloads. The first thing that I noticed is that the performance of a single download stream is quite good (i.e., around 12.6MB/s). What is also interesting to note is that while download performance doesn’t scale linearly with the number of concurrent curls, it is possible for a large instance to reach higher download speeds when downloading several objects in parallel. The maximum performance seems to flatten out around 50MB/s. At that point the large instance is operating at a CPU usage of around 22% user plus 20% system, which given the SSL encryption going on is nice!
Here are the raw HTTPS numbers: * 1 curl -> 12.6MB/s * 2 curls -> 21.0MB/s (10.5+10.5 MB/s) * 3 curls -> 31.3MB/s (10.2+10.0+11.1 MB/s) * 4 curls -> 37.5MB/s (9.0+9.1+9.8+9.6 MB/s) * 6 curls -> 46.6MB/s (8.0+7.8+7.6+7.9+7.8+7.5 MB/s) * 8 curls -> 49.8MB/s (6.0+6.3+7.0+6.1+6.0+5.9+6.2+6.3 MB/s)
The SSL encryption uses RC4-MD4, so there is a fair amount of work for both S3 and the instance to do. So the next natural question is to find out if there’s more to gain when talking to S3 without SSL. Unfortunately, the answer is no. While the load in the client reduces significantly (from 22% to 5% user and from 20-14% system when using 8 curls), the available bandwidth using non-SSL is basically the same (i.e., the differences fall within the margin of error). Which leads me to believe that in either case the instance is not the bottleneck.
Here are the same data points for non-SSL (HTTP) downloads: * 1 curl -> 10.2 MB/s * 2 curls -> 20.0 MB/s (10.1+9.9 MB/s) * 3 curls -> 29.6 MB/s (10.0+9.7+9.9 MB/s) * 4 curls -> 37.6MB/s (9.1+9.4+9.4+9.7MB/s) * 6 curls -> 46.5 MB/s (7.8+7.8+7.6+7.9+7.8+7.6 MB/s) * 8 curls -> 51.5 MB/s (6.6+6.4+6.6+6.3+6.2+6.2+6.7+6.3 MB/s)
Interestingly enough, a single non-SSL stream, seems to get less performance than an SSL one (10.2MB/s vs 12.6MB/s). I didn’t check whether the SSL stream uses compression, that may be one reason this is occurring.
So how about uploads? I’ll use the same setup but using curl to upload a 1GB file using a signed S3 URL. The first interesting thing to notice from the results is that 1 single upload stream gets half the bandwidth that the downloads get (i.e., 6.9MB/s vs. 12.6MB/s). However, the good news is that the upload bandwidth still scales when using multiple streams.
Here are the raw numbers for SSL uploads: * 1 curl -> 6.9MB/s * 2 curls -> 14.2MB/s * 4 curls -> 23.6MB/s * 6 curls -> 37.6MB/s * 8 curls -> 48.0MB/s * 12 curls -> 53.8MB/s
In other words: give me some data and I’ll fill-up S3 in a hurry :-). So what about using non-SSL uploads? Well, that turned out to be an interesting one… I’ve seen a single curl upload achieve the same performance as download (that is: 1 curl upload with no SSL, can achieve 12.6MB/s). But over quite a number of experiments I’ve seen non-SSL uploads exhibit a weird behavior where some of them mysteriously slow down and linger for a while almost idle (i.e., at a very low bandwidth). The end result is that the average bandwidth at the end of the run varies by a factor of almost 2x. I’m still investigating to see what happens.
The bottom line from these experiments is that Amazon is providing very high throughput around EC2 and S3. Results were readily reproducible (except for the problem described with the non-SSL uploads) and definitely support high bandwidth high volume operation. Clearly if you put together a custom cluster in your own datacenter you can wire things up with more bandwidth, but for a general purpose system, this is a ton of bandwidth all around.
Up to now you could run any Linux distribution on Amazon EC2 but you could only run Amazon’s 2.6.16 kernel. Well strictly speaking there are two kernels: a 22.214.171.124 kernel for the 32-bit instances and a 126.96.36.199 kernel for the 64-bit instances. But recently RedHat announced support for RHEL5 on EC2 and today they made it available publicly. Now guess what: the kernel you get if you launch their paid AMI (machine image) is “Linux version 2.6.18-53.el5xen (email@example.com)”! Long story short, Amazon now has the capability of running multiple kernels on the instances, but alas this is not yet available to mere mortals (i.e. non-Redhat). But hopefully wider availability of new kernels isn’t too far off. It’s nice to see EC2 evolving steadily!
If you’re interested in all the gory technical details, see my post on the Amazon forum.
Spurred by Morgan Tocker I ran some sysbench MySQL performance benchmarks on EC2 instances. This is just the first round, more to follow…
On a small instance, I reformatted /mnt with LVM2 and creates a 140GB xfs filesystem. In the my.cnf the important InnoDB settings I chose are:
innodb_buffer_pool_size = 1G innodb_additional_mem_pool_size = 24M innodb_log_file_size = 64M innodb_log_buffer_size = 8M innodb_flush_log_at_trx_commit = 2 # Write to log but don't flush on commit (it will be flushed every "second")
On a large instance, I created /mnt using LVM2 and striped across both drives to get a 200GB xfs filesystem. The my.cnf settings were:
innodb_buffer_pool_size = 4500M innodb_additional_mem_pool_size = 200M innodb_log_file_size = 64M innodb_log_buffer_size = 8M innodb_flush_log_at_trx_commit = 2 # Write to log but don't flush on commit (it will be flushed every "second")
I then ran the sysbench OLTP test as follows:
mysqladmin -u root create sbtest sysbench --test=oltp --oltp-table-size=1000000 --mysql-socket=/var/lib/mysql/mysql.sock --mysql-user=root prepare sysbench --num-threads=16 --max-requests=100000 --test=oltp --oltp-table-size=1000000 \ --mysql-socket=/var/lib/mysql/mysql.sock --mysql-user=root --oltp-read-only run sysbench --num-threads=16 --max-requests=100000 --test=oltp --oltp-table-size=1000000 \ --mysql-socket=/var/lib/mysql/mysql.sock --mysql-user=root run
Ok, now to the results, all numbers are transactions per second printed by sysbench.
Update: Morgan asked about the MySQL version and I realized I was using stone-aged-5.0.22. So I re-ran with 5.0.44 from the CentOS5-Testing repository. I also ran the benchmarks on an xlarge instance, with /mnt striped across 4 drives, once with my.cnf unchanged from large instance (4.5GB buffer pool) and once with 12GB buffer pool.
|EC2 small||5.0.22||227, 228, 230, 241||115, 116, 119|
|EC2 small||5.0.44||227, 229, 229||115, 115, 115|
|EC2 large||5.0.44||420, 428, 462||277, 310, 319|
|EC2 xlarge 4.5GB||5.0.44||620, 630, 637||463, 483, 495|
|EC2 xlarge 12GB||5.0.44||593, 598, 620||453, 481|
|AMD Sempron 64||5.0.22||383, 394||220, 225|
|All numbers are transactions per second as printed by sysbench. A range or multiple values indicate values from multiple benchmark runs.|
The iMac is a dual-core, 2.16Ghz, 2GB box with MySQL installed somehow and the machine was not 100% idle. The Sempron 64 is single core, “3400+” (2Ghz), raid-1 7200rpm drives, 2GB ram, not 100% idle (I really have gotten spoiled by EC2 and the ability to launch instances at a whim!). These tests are just meant as a ball-park point of comparison.
The benchmarks certainly confirm that the write performance on the small instances is, shall we say, lacking… I had expected a bigger improvement overall for the large instances, I guess for the read-only benchmark we’re seeing 2 disks vs. 1 disk, and on the read-write side we’re seeing 2 disks vs. “a problem”. With a real application load the large instance will often show a greater improvement over the small instance than shown here because the buffer pool increase can really make a huge difference.
Time to grab an x-large instance and try that…
NB: Note that Morgan’s blog entry referenced at the top uses myisam tables while I used InnoDB tables.
We’ve been working on an Ubuntu RightImage for a while now and it just took longer than expected to iron out all the little wrinkles. But now it’s available as
RightImage Ubuntu7.04V1_14_2 with AMI ID ami-f3cc299a and location
This is our first Ubuntu public image so we hope we covered all bases, but we’re eager to hear whether everything works as expected. As always the script to create the image is also available. To run the script, launch an Ubuntu image with working bundling code and run through the steps…
A little background on RightImages for those of you not familiar with what we’re doing: to configure servers we use a base image and install software at boot time using our server templates and RightScripts. This is way more modular and maintainable than baking entire servers into AMIs. Please see our blog post on the rationale behind this. As a result, we produce base images that are small yet have all the software utilities one just needs in EC2 already installed, from the Amazon EC2 tools to traceroute, curl, wget, etc. The second innovation we made with our images is that they are fully scripted and we publish the script. You can launch Amazon’s FC4 or FC6 image (well, for the Ubuntu RightImage you need to start with some Ubuntu instance), run our script, and out pops a clone of our RightImage. So if you want to see what we install or make some changes you can go right ahead.