Root Cause

During planned maintenance to remove an unused service instance, the AWS Auto Scaling Group (ASG) determined that our production messaging service instances did not match its specified configuration and terminated them. The loss of those instances caused upstream services to behave badly, resulting in cascading failure.

Impact

Sense REST APIs were unavailable from approximately 4:24pm to 5:20pm, disabling most client application functionality.

Because monitors buffer and retry data uploads, no data should have been lost during this period.

Improvements

Ensure that termination protection is enabled on critical instances.
Improve process around ASG changes.
Improve resiliency of upstream services to unavailability of the messaging service.

Posted Mar 23, 2018 - 14:37 EDT

Resolved

All APIs are now back online. We will update with a more detailed incident report later.

Posted Mar 22, 2018 - 18:00 EDT

Identified

One of our service components is experiencing an outage. We are actively working on resolving it and will update when we have an ETA.

Posted Mar 22, 2018 - 17:03 EDT