Controlled Chaos

10 March 2021 - Articles

Major outages in major public cloud providers such as Azure and AWS are rare, but they do happen. Today OVH had a major incident: “OVH datacenter burns down knocking major sites offline” and they’re not the only ones to experience these issues, for example Amazon had a major outage in November and Microsoft had one in September.

This prompted me to write up an article on Akimbo’s recent work building resilience into our platform, so today I’m going to talk a little bit about a couple of the features of AWS that allow for significant resilience and I’m going to do that by running you through my recent experiments on our platform which can be roughly summarised as “Turning things off to see what breaks.”

Chaos Engineering

You see, there’s a discipline known as Chaos Engineering, which is just that – experimenting on production systems to build confidence in the system’s ability to withstand unexpected conditions such as hardware failure. This time I wasn’t just interested in what would happen, but I was also interested in how quickly recovery would take.

As a baseline, I first looked at how quickly I could spin up a new server and manually configure it using our build scripts, to give the automated failover some context. I was able to spin up an EC2 instance, grab our codebase and run our build script to install all the prerequisites in…10 minutes 31 seconds. I was actually quite surprised at how quickly I could get a server up and running – but that’s mainly due to the benefits of Cloud Computing in general – and I’m not confident that I could do it that quickly under the duress of an outage.

Failing over with EC2

Our application is built in such a way that the web server and the database are decoupled entirely. That means that it doesn’t matter which web server your request hits, it’ll be handled without a problem. This isn’t necessarily required but it’s sometimes nice that the web servers are effectively stateless as we don’t have to track or control which server you talk to and it’s less impactful if a server suddenly becomes unavailable, as one of the others will handle the request for you.

Therefore, we just need to make sure there’s always a web server available and it makes no difference which one you talk to. We use a load balancer and “cloned” web servers. If a webserver develops a problem, we can nuke it and spin up another from our known-good image. To achieve this we use an AWS Application Load Balancer, with EC2 Auto-scaling groups that automatically populate the Load Balancer’s target group.

This has two benefits, we use auto-scaling to horizontally scale when user demand goes up – but also, if a server becomes “unhealthy”, to use the term AWS uses, it will be replaced. The auto-scaling group is configured with a minimum number of hosts that must be healthy and so will automatically spin up another if one has a failure, such as a service failure – or because I hit “Terminate” in the dashboard.

If a web server goes down, then the load is automatically picked up by another available server and we retain enough capacity to have a server drop entirely without any unavailability – or at least that’s what’s supposed to happen. So let’s hit the Terminate button and see what actually happens!

The instance terminates…and 1 minute 49 seconds later we’ve got a new instance serving requests. You see, the Load Balancer monitors the hosts in the target group, see that we’ve fallen either below the current demand or below the minimum number of healthy instances (both metrics are configurable) and automatically builds another one. To compare to our previous manual example, it took one-fifth the time required, and our manual example didn’t take into account the amount of time it would take me to notice the system outage, login to our dashboard and get through the pesky multi-factor authentication and troubleshoot what was actually wrong.

1 minute 49 seconds is very fast, but the exact amount of time it takes will vary somewhat, in part due to the way that an instance is determined to be “healthy”. I had the checks set to 15 seconds apart and a node to be considered unhealthy after 2 bad health-checks. So that works out as up to 30 seconds to notice the failure and around a minute to respond to it and restore services, but as before – you can tweak these settings to your needs.

Failover with RDS

We achieve the same thing with the databases, using AWS RDS and autoscaling. Specifically, we use AWS Aurora running MySQL and so, it’s possible to have multiple “Reader” instances and one “Writer” instance, allowing for load balancing across instances and failover.

For example, if I delete a Reader instance, that’s transparent to the end users as the system simply automatically fails-over to another Reader (or if there are no other Readers left it will read from the Writer). If the Writer is deleted, then a Reader is promoted and takes over the role. We’ve built the system in such a way that when an instance goes down, a new one spins up automatically. Again, this is achieved through auto-scaling.

I’m not going to lie, terminating a production database instance, to verify that everything is working correctly, is a scary prospect – but validating that your failover is working correctly and gaining that confidence that systems are resilient is important.

Especially given the news today, of a datacentre fire that I mentioned earlier. You see, auto-scaling in AWS not only supports replacing failed systems – but it supports failing into other Availability Zones and it’s also possible to automatically failover into an entirely different region if there’s a major outage.

If you’re new to the terminology: A “Region” is a geographically distinct area such as “US East (Ohio)” or “EU West (London)”. Each Region has multiple isolated locations known as “Availability Zones”. So, London is eu-west-2 and have three availability zones eu-west-2a, eu-west-2b and eu-west-2c. It might be important to note that AWS maps these availability zones per account – so eu-west-2c for you might not be the same as my eu-west-2c.

To account for that, whilst I run these checks, I’m not only checking that failover is successful and testing how long it takes for the system to recover – but I’m also testing that we’re failing over across AZ, and that if things get really bad, that we can fail to another region.

Facebook’s motto used to be “Move Fast and Break Things”, but back in 2014 they changed it to “Move Fast With Stable Infra”. A big part of having stable infrastructure is building confidence in that stability – and Chaos Engineering is one way to build that confidence.

Play Cover Track Title
Track Authors