Rack Authentication Postmortem
On Wed. August 16th and Thur. August 17th some users could not access their Racks from the CLI or through Console due to authorization problems. While there were no reports of application downtime during this time this was a total control plane outage for some users.
We apologize for the inconvenience to affected users.
I’d like to explain more what happened and some steps we will be taking to prevent this from happening again.
The biggest change we’d like to implement is the concept of a ‘stable’ and ‘unstable’ release channel. See GitHub Issue #477 – Support Different Endpoints For
convox rack update – to participate in the design of this enhancement and to track progress.
Root Cause and Recovery
Recently we set out to make a security improvement to Rack to not display the API key in CloudFormation in plaintext, as reported in GitHub Issue #425.
Users that updated to this version and then modified Rack settings from the CLI with
convox rack params set InstanceCount=4 or from the Console UI inadvertently updated the API key to the literal value
****. This invalidated API keys stored in ~/.convox and/or in Console effectively taking the Rack API offline.
When we understood the problem, we unpublished the version in question, notified non-affected users roll back Racks to the last good release, and worked with affected customers to restore connectivity.
Affected users had to perform a manual CloudFormation procedure to roll back to an earlier release and set a new API key and to work directly with the Convox team to restore Console connectivity.
We recommend everyone do a standard
convox rack update to get this latest version.
Code refactors to simplify and test mission critical paths like CloudFormation updates are already under way (see Pull Request 1084).
We are also looking at ways to improve integration testing and to implement a manual testing checklist around critical paths like API key management.
The clearest feedback from users is to move to a stable and unstable release management system. You can follow GitHub Issue #477 to track the design and implementation of this.
We will continue to improve how we:
- Use unit testing for fast assurance around how Rack interacts with AWS on every Pull Request
- Use integration testing to actually install, deploy, modify and uninstall Racks on every Release
- Use checklists to manually test paths that are not yet automated
- Offer a way to put staging racks on an
unstablechannel for maximum velocity
- Offer a way to put production racks on a
stablechannel for maximum reliability