Advice and tools for AWS AutoScaling
AutoScale is a really cool feature of AWS. It lets you define what infrasturcture you would like to be running and then it automatically create & destroys instances to maintain that state.
This is really great for a few reasons:
- Scaling: If you get a surge of traffic, new capacity can come up automatically to handle it
- Availability: If a bunch of instances die at once in an emergency, AutoScale will automatically replace them for you
- Cost management: Automatically reduce capacity during off-peak load to save money
We started using it for a number of our services over the last few months and it’s been great. Some of our older services are not AutoScaled yet, but I’ll move those over at some point and, going forward, all new services will be architected for AutoScale from the start. This post discusses some of the considerations and tools we use for AutoScale here at Wish. We’re releasing one of them open source makes it easier to setup & manage AutoScaled deployments.
Making Instances Ephemeral
To use AutoScale effectively, you need to architect your instances to be ephemeral. That doesn’t necessarily mean using ephemeral storage (but that’s usually a good idea), but you need to be able to survive any AutoScaled instance being terminated or created at any time in an unattended way. (As an aside, a great tool for making sure things continue working like that is Netflix’s Chaos Monkey) Getting to that point isn’t trivial, but it lets you sleep a lot better at night knowing that even if an entire availability zone goes down or you get hammered with traffic, you can trivially add the capacity you need.
For stateless services, this probably isn’t too bad. If your frontends are behind a load balancer, you can automatically add/remove instances from that and if your backends all read from a common queue adding or removing servers there should be easy. It’s obviously trickier for more stateful things like a database where you need to do non-trivial I/O to go from fresh DB server to something useful and we don’t AutoScale those right now for that reason.
Tools like Chef and Puppet help a lot here. The learning curve can be a bit steep, but you should check them out if you haven’t already and you’re thinking of doing something like this. In a nutshell, they give you a nice way to describe the state of a server so you can automatically go from a fresh OS install to a configured server.
One tool we built to help with this process takes a Chef “role” (i.e. a type of server to configure) and a base Ubuntu AMI then automatically creates a new AMI for that server type. This way, whenever we make a major infrastructure change, I can run a single script to get a new AMI with the change baked right in. That makes it easy to make changes across a lot of infrastructure.
Once you can configure the OS and basic services the way you want in a reliable way, you probably need to deploy your latest code to the box and do other init tasks when it comes up.
Our AMI creation tool leaves a script in the AMI that, using Ubuntu’s cloud-init, we run when the new instance is first booted.
This script does a few things:
- Register the new node with the Chef server
- Run Chef on the node to pick up any new changes since the AMI was created
- Deploy the latest production code
- Start our app
So far, that pattern of baking most infrastructure changes into the AMI and then having a script to grab the latest Chef changes and code has given us a good balance of low time-to-launch and ease of maintenance.
Before AutoScale, our deployment process basically involved a tool to
scp the new code to each host and run commands to restart services. The problem with this push-based deployment is that if a new server comes up, you need an operator to push the right version of the code to it. This is fine if bringing up new capacity is always attended and happens rarely, but AutoScale could create a new instance in the middle of the night and needs to be able to get the latest code itself.
So, we changed to a pull-based deployment. In a nutshell, our new deploy tool:
- Pushes the code to S3
- Gets hostnames of running instances from Chef
- On each one, runs a script that pulls the code from S3 and restarts the app (same as steps 3 & 4 from above)
One little gotcha with our approach is that if you’re churning through instances constantly, your Chef or Puppet server will end up with a ton of nodes that have been terminated. Our solution to that was to run a script every few minutes that looks for EC2 instances in the “Terminated” state and removes them from Chef and so far that’s worked pretty reliably.
The other hurdle to using AutoScale (especially when you’re first starting out) is that there is no UI to change or see your configuration. You can use the AWS command-line tools to manage it, but that can become error-prone and doesn’t scale well to many different engineers managing many different AutoScaling groups.
To manage this at Wish, we built a tool that I’m releasing open source to make this process be driven by a config file. We chose the config file approach because it distills AutoScaling down to a simpler model that’s easy to work with, gives ops a complete picture of the current state, and can be tracked via source control.
You can find this tool, AutoScaleCTL, on GitHub.
- Download code from GitHub
pip install -r pip-requirements(you can probably skip this if you have a fairly recent
sudo python setup.py install(makes
/usr/local/bin, so you’ll probably need sudo here)
If you haven’t already, setup your
boto config file at
[Credentials] aws_access_key_id = <your access key> aws_secret_access_key = <your secret key>
The first step is to copy and edit the sample
autoscale.yaml to fit your configuration. A lot of sections are optional so you can start simple to play around and then build out more complexity as you go. It doesn’t support every feature of AutoScale, but it supports a pretty good set. If you want to add more, feel free to send a pull request.
When that’s done, run
One important note is that it doesn’t support removing AutoScaling groups or alarms. So, if you delete a section from the config file, it won’t be deleted in AWS. You’ve gotta take care of that one manually.
One big omission from this is supporting spot instances. The trick with spot instance AutoScaling is that you need to make sure your total capacity needs are always met while trying to launch spot instances over on-demand instances when possible. There’s no explicit support for this in AutoScale itself (you can create an AutoScaling group that will buy spot instances up to your bid price, but the trick is figuring out how to calibrate alarms so that on-demand capacity kicks in when needed but doesn’t replace available spot capacity). If you have any experience trying this out that you want to share or want to try it out, I’d love to hear from you.
Another big area for future work here is around supporting a wider range of metrics. CPU, I/O, or memory stats can be a bit of a blunt measure of demand. Better support around custom metrics would make it easier to directly define scaling rules in terms of the things that matter (latency, queue sizes, request volume, etc).
There are also a bunch of fields that aren’t supported in the config file. I got the required ones (and the ones we use at Wish), but things like attaching instances to an ELB isn’t supported right now. It’s pretty easy to add those things if you find you need them (feel free to ping me if you’re not sure about how) and I’ll keep the tool updated as I improve on it for Wish.