A very interesting read how Netflix weathered the failure that affected AWS entire region with more than few hours downtime.
The importance of building cloud services to resist failure was demonstrated on Sunday when Amazon Web Services (AWS) suffered major disruption.
Some of the internet’s biggest sites and apps were intermittently unavailable after more than 20 services on the AWS platform began failing.
The outages affected AWS services run out of datacenters in North Virginia – which the company refers to as the US-EAST-1 region.
AWS is used by many major firms to support their online operations. As a result, users of Netflix, Tinder, Airbnb and IMDb reported problems accessing content during the six- to eight- hour period that Amazon’s cloud services were affected. The problems also hampered Amazon products such as the Echo, Amazon’s smart speaker that uses a cloud-based voice recognition system called Alexa.
The problems seem to have first appeared at 3am PDT on Sunday – when users began experiencing increased error rates on AWS’s NoSQL database DynamoDB.
However problems with increased errors and latency then began cropping up in about 22 other AWS services – including major offerings such as AWS Elastic Compute Cloud (EC2), the virtual desktop service AWS WorkSpaces and the AWS event-driven compute service AWS Lambda.
Within a couple of hours AWS had identified the “root cause” of the problems with DynamoDB, pinpointing it as a “failure of an internal sub-service that manages table and partition information”.
By just after 9am PDT AWS had resolved the problems with DynamoDB and reported it as operating normally. Most services were running normally by this time, although a few other services such as AWS Auto Scaling, were affected until close to 11.30am.
How Netflix fought disruption with chaos
One AWS customer that managed to avoid any “significant impact” from the outages, according to a spokesman, was the video streaming site Netflix.
The online media giant relies on Amazon Web Services to stream movies and TV shows to more than 50 million homes worldwide and was able to “quickly” restore the service to being fully operational, according to the spokesman.
Helping the service to weather the service disruption was its practice of what it calls “chaos engineering”.
The engineering approach sees Netflix deploy its Simian Army, software that deliberately attempts to wreak havoc on its systems. Simian Army attacks Netflix infrastructure on many fronts – Chaos Monkey randomly disables production instances, Latency Monkey induces delays in client-server communications, and the big boy, Chaos Gorilla, simulates the outage of an entire Amazon availability zone.
By constantly inducing failures in its systems, the firm is able to shore itself up against problems like those that affected AWS on Sunday.
In that instance, Netflix was able to rapidly redirect traffic from the impacted AWS region to datacenters in an unaffected area.
Netflix was able to do this because it practices what it refers to as multi-region, active-active replication – where all of the data needed for its services is replicated between different AWS regions in a way that allows rapid recovery from failures.
“Complete regional infrastructure outage is extremely unlikely, but our pace of change sometimes breaks critical services in a region, and we wanted to make Netflix resilient to any of the underlying dependencies,” Netflix said in a blog post outlining the practice.
Adrian Cockcroft, former chief architect for high performance technical computing at Netflix, said on Twitter that active-active replication adds about “25 percent” to costs and described the approach as an “insurance policy”.
“Most of the extra cost is the storage tier duplication at 100% scale both sides all the time,” he added.
Netflix uses Apache Cassandra, an open-source NoSQL distributed database. To maintain availability the service has to maintain a “few thousand” Cassandra nodes in “all regions”, Cockcroft said.