On Sunday, September 20, a downtime of Amazon Web Services (AWS) affected a number of other, large services relying on this resource. Here are some insights into how this outage was reported. The issues affected the performance for several hours before everything got back to normal. Below are some more details as to the causes, based on information from the AWS status board.
AWS (Amazon Web Services) suffered from a service disruption on Sunday, September 20, affecting many other services and websites relying on AWS to provide services. Sites like The Register described it as a “monster outage”. The troubles seem to have started in the Amazon North Virginia US-EAST-Site.
As of Monday, September 21, all services on the page where reported to “operating normally” again. The service disruption though is listed on the status page, if one scrolls quite a bit down on that page to the “Status history”. Here a click on the red icon displays the explanation given for parts of the outage, including some information on what the engineers believe was the cause for the outage. “The root began with a portion of our metadata service within Dynamo DB. As a result APIs where throttled for lower data throughput.
Below is what can be extracted from the AWS status board as to how this downtime started and how it was communicated:
VED] Increased API error rates
3:00 AM PDT We are investigating increased error rates for API requests in the US-EAST-1 Region.
3:26 AM PDT We are continuing to see increased error rates for all API calls in DynamoDB in US-East-1. We are actively working on resolving the issue.
4:05 AM PDT We have identified the source of the issue. We are working on the recovery.
4:41 AM PDT We continue to work towards recovery of the issue causing increased error rates for the DynamoDB APIs in the US-EAST-1 Region.
4:52 AM PDT We want to give you more information about what is happening. The root cause began with a portion of our metadata service within DynamoDB. This is an internal sub-service which manages table and partition information. Our recovery efforts are now focused on restoring metadata operations. We will be throttling APIs as we work on recovery.
5:22 AM PDT We can confirm that we have now throttled APIs as we continue to work on recovery.
5:42 AM PDT We are seeing increasing stability in the metadata service and continue to work towards a point where we can begin removing throttles.
6:19 AM PDT The metadata service is now stable and we are actively working on removing throttles.
7:12 AM PDT We continue to work on removing throttles and restoring API availability but are proceeding cautiously.
7:22 AM PDT We are continuing to remove throttles and enable traffic progressively.
7:40 AM PDT We continue to remove throttles and are starting to see recovery.
7:50 AM PDT We continue to see recovery of read and write operations and continue to work on restoring all other operations.
8:16 AM PDT We are seeing significant recovery of read and write operations and continue to work on restoring all other operations.
9:12 AM PDT Between 2:13 AM and 8:15 AM PDT we experienced high error rates for API requests in the US-EAST-1 Region. The issue has been resolved and the service is operating normally. (Source: AWS Status Board)
Another site, “Downdetector.com” offers a bit of more insight to reported problems by users, as well as the duration and severity of the outage. On this site users are provided with an option to report „I have problems with AWS“, these reports are then added to a chart/timeline of the event.
While the fact that outages happen is not a surprise, it is still a bit surprising how far one mistake somewhere can travel. One company. Fusion Interactive Group, was quick to publish a press release on Monday, that the systems in operation there where not affected, due to some precautions. Sadly the press release did not go into further detail how this company prepared against downtime of cloud services.
But what the press release had, was some thinking that will presumably spread further: “
“Systems need to be built with an understanding that the goal is to weather any storm, not to try to avoid every storm”, a spokesperson from Fusion Interactive Group in a press release.
Links:
http://status.aws.amazon.com/
https://downdetector.com/status/aws-amazon-web-services
http://www.theregister.co.uk/2015/09/20/aws_database_outage/



