TvE 2100

At 2100 feet above Santa Barbara

RightGrid: The Easy Way to Scale Your Background Jobs

We’ve found that many sites have background jobs that they want to scale, and the flexibility and pricing of Amazon EC2 is just irresistible. The best architecture for this “background processing” or “batch processing” type of problem is a combination of Amazon EC2, SQS, and S3. The web server or the batch processing job manager enqueues a work descriptor for each task on an SQS queue. Together with the work descriptor it stores the data to be processed in S3. Then a bunch of worker servers pull items off the queue and start working. This typically involves fetching the data from S3, processing it, storing results back onto S3, and enqueueing a result descriptor into another SQS queue. The web server or another central server can then pull items off the result queue and “do the right thing” with the results. Sometimes this means setting flags in the central database or kicking off additional work items.

Writing a prototype of this from scratch is not a whole lot of work. But if you want to put it into production the scope of the work quickly balloons. You want to monitor the queue and the workers in a reasonable manner. How many items are in the queue over time? How many worker servers are running? How long does it take a work item to be processed? Etc. Then you notice that the load fluctuates a lot over time and you want to reap the “on-demand” nature of EC2 and launch worker servers automatically when the queue size increases too much and terminate the servers again to save costs when the queue shrinks. Of course something goes wrong every now and then and you want to know about it and troubleshoot it. But since servers “go away” when the load reduces you need to store log files centrally so you can actually check what happened and have a chance at fixing the problem. And how about some manual controls so you can override the automatic scaling because some important items need to be processed asap, or there is a lot of work queued up that you don’t care about being processed all that quickly…

At RightScale we think it makes no sense for everyone to reinvent the wheel and implement all the above features. Wouldn’t it be so much nicer if you could just drop the core of your worker application into a ready-made image and let us deal with all the stuff that makes it hum? That’s exactly what RightGrid is!

What RightGrid provides: * A framework in the form of scripts that run on your worker instances that handle the following sequence of steps: 1. pulling work items from SQS 1. retrieving associated data from S3 1. invoking your actual processing code on the data retrieved from S3 and stored in a local temp directory 1. uploading result files produced by the processing to S3 1. pushing a result queue entry onto an SQS result queue 1. making a result HTTP POST to your web server (or other) 1. pushing an audit entry with status and timing information for all the steps onto an audit queue 1. pushing log file entries to S3 indexed so they can be retrieved for later troubleshooting 1. terminating the instance if no work is left * Monitoring of the work queue size/age and automatic launching of additional work instances when configurable load thresholds are exceeded * Processing the audit queue to collect performance statistics and flag errors * Ruby gems to interface your web server or other job producer to SQS and S3 with error handling and persistent connections

The whole RightGrid architecture is being used as back-end to a web server and also as a batch job system. In the web server case, typically individual work items are enqueued in response to user actions on the web site whereas in batch job submission settings large sets of tasks are enqueued in rapid succession when a job is launched.

The benefit of RightGrid lies not only in its ability to let your application scale up when it’s needed, but also to scale down when it’s not – thus saving costs. It’s one of the key components we have created at RightScale that help fulfill the promise of utility computing that Amazon Web Services offers.

If you’re interested in learning more about RightGrid, please contact us at RightGrid is available with RightScale’s premium licensed accounts.