Software running in the cloud is unreliable. Failures are expected and they occur often for the following three reasons:
- The cloud uses commodity hardware
- Software running in the cloud is updated frequently
- Changes are not coordinated, multiple systems may be updated at the same time
These three challenges are amplified at scale as the number of instances and number of dependencies increases.
The cloud is designed to be cost effective. By using off-the-self components cost is lowered. Instead of creating redundancy in hardware, cloud providers utilize cheaper software redundancy. In the cloud, devices fail often and are replaced by a new device. This fail and replace cycle is a natural and an expected part of running in the cloud.
Cloud software is updated frequently. The frequent updates are driven by two needs. First is the desire to lower risk by pushing small updates early and often. Large batch updates have a larger change set and carry a high risk of failure. Second the updates need to be deployed uniformly to prevent version drift. Frequent deployments keep software updated and minimizes version drift.
Every running system has dependencies. A classic three tier system has a datastore, an application layer, and a UI component. Cloud software has no knowledge of these dependencies. Cloud software only understands fault domains at a service level. Updates can happen at any time without prior knowledge of the application owner. In addition, cloud updates may be initiated across multiple “tiers” and executed concurrently. In our three tier example both the datastore and application layer may be updated at the same time.
At the level of a single Virtual Machine updates will queue behind each other waiting for an opportunity to execute. Cloud initiated changes like an Operating System update will cause application updates to queue and wait. In Azure, sometimes a deployment will timeout and will need to be rolled back. These timeouts prevent new code from being deployed, and can be seen more often in large services with hundreds of Virtual Machines.
To ensure reliability it is the responsibility of the application developers to build their system around these fault domains. The application developer needs to expect scenarios where part of the system is offline and not be able to accept changes.
As the number of nodes and the number of interrelated systems increases so does complexity. An update may cross multiple service, multiple datacenters, and multiple nodes. At scale a serial execution would take far too long to perform. Therefore, scale drives the need to execute more tasks in parallel.
At scale it is very difficult to update all nodes at once. Therefore developers need to expect v-previous and v-next versions of software running side by side. This drives an explicit need for forward and backward compatibility.
Increasingly customers expect their software to be updated and improved frequently. Apps on iOS and Android are updated automatically without the user knowing. The most popular websites are constantly running daily experiments. These changes and updates require frequent deployments of code and configuration.
Done well a good deployment system can be a differentiator
- Rapid improvement cycle for customers
- Increased volume of changes to support multiple concurrent experiments
- Rapidly adopt to vulnerabilities
- Effectively manage changes in government policy (i.e. user data retention)
Getting new features to users via deployment requires and inventory of the environment, an assessment of readiness, and a set of changes. This process needs to be robust enough to recover from failures. In addition large scale creates the need to run the smallest possible changes in parallel. This requires coordination and orchestration.
Idempotent: ability to repeat deployment steps and get the same outcome regardless of environment
Canary: ability to roll out a change to a defined portion of the cloud service and hold that version indefinitely
Parallel execution: understanding of fault domains and able to deploy concurrently across different systems and different data centers while keeping enough capacity to run the system as a whole.
Rollback: declarative command to get back to last know good state. Should not have manage dependencies or explicit version.
Catch up: ability to update to v-current from an older version either through a complete rebuild or through updating successive versions. Needed when old VMs come back into rotation. Ideally issued as a declarative command
Skip: ability to deploy software and skip ineligible nodes. Ineligible nodes are often blocked as system updates take place. Ineligible nodes should be removed from taking live traffic.
Chained State transition: ability to progress to an outcome through a series of Idempotent state changes. Failures can be picked up from the last successful state transition. This minimizes the change set an in turn decreases risk and speeds up deployments.
Transparency and visibility: central repository to see environment data, work in progress, and current state. Scripts used to automate and orchestrate changes should be easily accessible and shared.