Thursday, August 2, 2012

AWS Load Spike o' Death

My company has been using Amazon Web Services for our production web architecture for a while now. We implemented during the summer months when our usage is a bit lighter, and slowly gained experience into how to make best use of the services for our needs. It's been a fun ride, and I'm a convert.

But there are definitely some aspects of using AWS that are unique, and one of them killed my service one day. There's an issue with EC2 instances encountering a load spike every day at about the same time. And I'm not the only one seeing it. Check out this search on the AWS Developer Forum.

Some of us have been hunting this problem for more than a year now. There are some qualities of the problem that we've been able to isolate, but we still have no knowledge of what's really happening.

The problem appeared for us one day as our load climbed to our normal peak levels. We run mod_perl inside Apache, which has something of a RAM footprint but nothing really noteworthy. Our system had been responding just fine at a load factor of about 0.6 when suddendly the load spiked up into the 30s. The machine locked up completely, taking our site offline. With no idea of what was going on, I was afraid we were under a DoS attack of some kind. It took us two hours to get our system back online, and then spent the rest of the day nervously monitoring performance and trying to figure what had happened.

The next day, the same thing happened, at essentially the same time. CPU spike to 30, complete meltdown. We were more prepared this time, and were able to get our service back online in a few minutes, although my business team were less than amused. Preparing for this to be a pattern, we started an overkill procedure, beefing up each of the links in our deployment architecture to handle ridiculous amounts of load. AWS makes this easy - Launch Instance - click!. Too much of everything was just enough.

The next day, the load spike did not disable our service, and we began a long process of watching each day for the spike, and trying to instrument our setup to figure out what was happening. We eliminated at least one theory per day for weeks. Actual load? DoS? Throttling? Database job? Timeout? cron? iptables? After weeks (months) of experiments, we still have no idea, and we still see the spike in our cpu traces, virtually every day.

For a while, it seemed like we were seeing a kernel bug, something really deep in the swap handling. The problem definitely gets really bad if we allow Apache to start up as many handlers as it would like to, and we run out of memory. No surprise. But using a patched kernel didn't make the spike go away. Nor did changing Linux distros.

After that, for a while I thought it might be a disk bandwidth problem, maybe an issue with access to EBS during internal AWS cleanup or backup processes. The problem definitely shows up in vmstat as a spike in the io(bi) measure, followed by a longer period of elevated io(bo). But switching our servers off of EBS didn't change the spike behavior either.

At this point, the load spike is a feature of my universe, a force of nature. Like the tide, it is predictable and reliable, but I have no knowledge of what hidden force is causing it. I can see its effect in the stream of questions on the AWS Developer forums but have no answers. We've learned to manage the spike with good system discipline and a couple minor architectural changes, and the system admin's traditional over-provisioning. My favorite conspiracy theory is that the load spike is a ploy by the Amazon folks to get everyone to upsize their EC2 instances, thus increasing revenue. But I don't actually think that.

For any wayward Googlers searching for relief from the same problem, I can recommend you try several configuration changes:

1. Most important: use top or smem to get some idea of how large your Apache server footprint is, and set MaxClients to a number that will fit within your existing RAM. That clips one horrible problem by preventing Apache from starting enough new servers to send your machine into swapping hell, effectively killing your service.

2. Set your MinSpareServers and MaxSpareServers relatively low. Ours are at 1 and 2, again to keep Apache from starting up gobs new servers when the load spike hits. (It seems to take some CPU for the VM to clear the problem, and starting up new processes makes that task take longer).

3. Spread out your load. Our workload has a strong seasonality component to it, so we bring up extra resources going into that period and take them down later. Even though the spike hits all of the servers at the same time, more cpu means faster clearing. If your loads are less predictable, you can always address the cpu spike problem by starting up an extra instance or two for the couple hours around "spike time" (22:00 - 00:00 GMT).

4. Try out a high-CPU EC2 instance type. Looking through the forums you can see that the problem hits hardest those people who run a small instance relatively close to the ceiling. That's what we were doing originally ("it's just apache, why should we allocate a lot of resources to it?") and the switch to c1.medium is huge even though the cost difference is minor.

5. Take advantage of caching. This is sort of a no-brainer, but we were letting our apache/mod_perl processes serve everything, including all of the static objects. This turned out to be a poor use of RAM, as the mod_perl libraries take up a ton of RAM and are completely unused serving statics.  Adding in a caching tier drastically reduced the activity on our app tier, thus reducing the number of servers Apache thought we needed to run, thus reducing RAM demand, etc.

And please let me know if you find any clues about this crazy thing.

28-May-2013 Update:

The last time we touched our server configs, I made one minor change to test out the guess that this was somehow related to EBS use: set APACHE_LOG_DIR to point to local ephemeral storage rather than the EBS volume. The idea stemmed from our observation that new instances didn't seem susceptible to the load spike until they had been through some significant log-writing, leading us to wonder if the spike itself somehow originated out of the EBS channel.

We have recently been through another of our load periods, and have seen the load spike appear in our logs again, with one significant difference: we only see the spike on our database machine. The various apache servers (caching and app tiers) show no sign of the spike even during periods where the additional load is very clear on the database tier.

Though we still have no clear cause for the issue, it would seem that moving your Apache logs to local ephemeral disk can help.

No comments:

Post a Comment