Amazon’s cloud services outage tracked to bad keystrokes

By Brian Fung

Washington Post

WASHINGTON — Amazon is back with an apology and an explanation for a high-profile malfunction that caused websites across the Internet to grind to a halt for hours on Tuesday.

The online retail giant, which runs a popular cloud computing platform for sites such as Airbnb, Netflix, reddit, and Quora, is blaming the outage on a simple employee mistake.

A team member was doing maintenance on Amazon Web Services Tuesday, trying to speed up the billing system, when he or she tapped in the wrong codes — and inadvertently took a few more servers offline than the procedure was supposed to, Amazon said in a statement Thursday. With a few mistaken keystrokes, the employee wound up knocking out systems that supported other systems that help the web services work properly.

The cascading failure meant that many websites could no longer make changes to the information stored on Amazon’s cloud platform. For everyday users, that meant being unable to load pages, transfer files, or take other actions on some of the sites they regularly use.

‘‘In this instance, the tool used allowed too much capacity to be removed too quickly,’’ Amazon said. ‘‘We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.’’

Translation: Employees will no longer be able to unplug whole parts of the Internet by mistake.

Amazon said it was sorry for the outage’s effect on its customers and vowed to learn from the incident.