Reliable Systems

Cards (30)

    • Availability is the percent of time a system is running and able to process requests. To achieve high availability, monitoring is vital. Health checks can detect when an application reports that it is ok. More detailed monitoring of services using clear box metrics to count traffic success and failures will help predict problems. Building in fault tolerance by, for example, removing single points of failure is also vital for improving availability. Backup systems also play a key role in improving availability.
    • Durability is the chance of losing data because of hardware or system failure. Ensuring that data is preserved and available is a mixture of replication and backup. Data can be replicated in multiple zones. Regular restores from backup should be performed to confirm that the process works as expected.
    • Scalability is the ability of a system to continue to work as user load and data grow. Monitoring and autoscaling should be used to respond to variations in load. The metrics for scaling can be the standard metrics like CPU or memory, or you can create custom metrics like “number of players on a game server.”
  • Avoid single points of failure by replicating data and creating multiple virtual machine instances.
    • It is important to define your unit of deployment and understand its capabilities. To avoid single points of failure, you should deploy two extra instances, or N + 2, to handle both failure and upgrades. These deployments should ideally be in different zones to mitigate for zonal failures.
  • Avoid single points of failure:
    • Let me explain the upgrade consideration: Consider 3 VMs that are load balanced to achieve N+2. If one is being upgraded and another fails, 50% of the available capacity of the compute is removed, which potentially doubles the load on the remaining instance and increases the chances of that failing. This is where capacity planning and knowing the capability of your deployment unit is important. Also, for ease of scaling, it is a good practice to make the deployment units interchangeable stateless clones.
  • Avoid Single Points of Failure:
  • It is also important to be aware of correlated failures.
    • These occur when related items fail at the same time. At the simplest level, if a single machine fails, all requests served by that machine fail. At a hardware level, if a top-of-rack switch fails, the complete rack fails. At the cloud level, if a zone or region is lost, all the resources are unavailable. Servers running the same software suffer from the same issue: if there is a fault in the software, the servers may fail at a similar time.
  • Correlated failures can also apply to configuration data. If a global configuration system fails, and multiple systems depend on it, they potentially fail too. When we have a group of related items that could fail together, we refer to it as a failure or fault domain.
  • Avoid Correlated Failures:
    • It is useful to be aware of failure domains; then servers can be decoupled using microservices distributed among multiple failure domains. To achieve this, you can divide business logic into services based on failure domains and deploy to multiple zones and/or regions.
  • Avoid Correlated Failures:
    • At a finer level of granularity, it is good to split responsibilities into components and spread these over multiple processes. This way a failure in one component will not affect other components. If all responsibilities are in one component, a failure of one responsibility has a high likelihood of causing all responsibilities to fail.
  • Avoid Correlated Failures:
    • When you design microservices, your design should result in loosely coupled, independent but collaborating services. A failure in one service should not cause a failure in another service. It may cause a collaborating service to have reduced capacity or not be able to fully process its workflows, but the collaborating service remains in control and does not fail.
  • Cascading failures
    • Occur when one system fails, causing others to be overloaded and subsequently fail. For example, a message queue could be overloaded because a backend fails and it cannot process messages placed on the queue.
  • Cascading Failures:
    • The graphic on the left shows a Cloud Load Balancer distributing load across two backend servers. Each server can handle a maximum of 1000 queries per second. The load balancer is currently sending 600 queries per second to each instance. If server B now fails, all 1200 queries per second have to be sent to just server A, as shown on the right. This is much higher than the specified maximum and could lead to cascading failure.
  • Avoid Cascading Failures:
    • Cascading failures can be handled with support from the deployment platform. For example, you can use health checks in Compute Engine or readiness and liveliness probes in GKE to enable the detection and repair of unhealthy instances. You want to ensure that new instances start fast and ideally do not rely on other backends/systems to start up before they are ready
  • Avoid Cascading Failures:
    • The graphic on this slides illustrates a deployment with four servers behind a load balancer. Based on the current traffic, a server failure can be absorbed by the remaining three servers, as shown on the right-hand side. If the system uses Compute Engine with instance groups and autohealing, the failed server would be replaced with a new instance. As I just mentioned, it’s important for that new server to start up quickly to restore full capacity as quickly as possible. Also, this setup only works for stateless services.
  • Query of Death Overload:
    • You also want to plan against query of death, where a request made to a service causes a failure in the service. This is referred to as the query of death because the error manifests itself as overconsumption of resources, but in reality is due to an error in the business logic itself. This can be difficult to diagnose and requires good monitoring, observability, and logging to determine the root cause of the problem. When the requests are made, latency, resource utilization, and error rates should be monitored to help identify the problem.
    • You should also plan against positive feedback cycle overload failure, where a problem is caused by trying to prevent problems! This happens when you try to make the system more reliable by adding retries in the event of a failure. Instead of fixing the failure, this creates the potential for overload. You may actually be adding more load to an already overloaded system. The solution is intelligent retries that make use of feedback from the service that is failing. Let me discuss two strategies to address this.
  • If a service fails, it is ok to try again. However, this must be done in a controlled manner. One way is to use the exponential backoff pattern. This performs a retry, but not immediately. You should wait between retry attempts, waiting a little longer each time a request fails, therefore giving the failing service time to recover. The number of retries should be limited to a maximum, and the length of time before giving up should also be limited.
  • Exponential Backoff:
    • As an example, consider a failed request to a service. Using exponential backoff, we may wait 1 second plus a random number of milliseconds and try again. If the request fails again, we wait 2 seconds plus a random number of milliseconds and try again. Fail again, then wait 4 seconds plus a random number of milliseconds before retrying, and continue until a maximum limit is reached.
  • Circuit Breaker:
    • The circuit breaker pattern can also protect a service from too many retries. The pattern implements a solution for when a service is in a degraded state of operation. It is important because if a service is down or overloaded and all its clients are retrying, the extra requests actually make matters worse. The circuit breaker design pattern protects the service behind a proxy that monitors the service health. If the service is not deemed healthy by the circuit breaker, it will not forward requests to the service.
  • Circuit Breaker:
    • When the service becomes operational again, the circuit breaker will begin feeding requests to it again in a controlled manner.
    • If you are using GKE, the Istio service mesh automatically implements circuit breakers.
  • Lazy deletion
    • Is a method that builds in the ability to reliably recover data when a user deletes the data by mistake. With lazy deletion, a deletion pipeline similar to that shown in this graphic is initiated, and the deletion progresses in phases. The first stage is that the user deletes the data but it can be restored within a predefined time period; in this example it is 30 days. This protects against mistakes by the user.
  • Lazy Deletion:
    • When the predefined period is over, the data is no longer visible to the user but moves to the soft deletion phase. Here the data can be restored by user support or administrators. This deletion protects against mistakes in the application. After the soft deletion period of 15, 30, 45, or even 50 days, the data is deleted and is no longer available. The only way to restore the data is by whatever backups/archives were made of the data.
    • I want the web UI to be highly available, so here I depict it as being deployed behind a global HTTP Load Balancer across multiple regions and multiple zones within each region.
    • I deploy the accounts and products services as backends to just the us-central1 region, but I’m using multiple zones (us-central1-a and us-central1-b) for high availability. I even have a failover Cloud SQL database. The Firestore database for the products service is multi-regional, so I don’t need to worry about a failover.
  • High availability:
    • Can be achieved by deploying to multiple zones in a region. When using Compute Engine, for higher availability you can use a regional instance group, which provides built-in functionality to keep instances running. Use auto-healing with an application health check and load balancing to distribute load.
    • For data, the storage solution selected will affect what is needed to achieve high availability.
  • High Availability:
    For Cloud SQL, the database can be configured for high availability, which provides data redundancy and a standby instance of the database server in another zone. This diagram shows a high availability configuration with a regional managed instance group for a web application that’s behind a load balancer. The master Cloud SQL instance is in us-central1-a, with a replica instance in us-central1f.
    • Some data services such as Firestore or Spanner provide high availability by default.
  • Regional managed instance groups can be distributed VMs across zones. You can choose between single zones and multiple zones (or regional) configurations when creating your instance group, as you can see in this screenshot.
  • Google Kubernetes Engine:
    • Google Kubernetes Engine clusters can also be deployed to either a single or multiple zones, as shown in this screenshot. A cluster consists of a master controller and collections of node pools.
    • Regional clusters increase the availability of both a cluster's master and its nodes by replicating them across multiple zones of a region.
  • If you are using instance groups for your service, you should create a health check to enable auto healing.
    • The health check is a test endpoint in your service. It should indicate that your service is available and ready to accept requests, and not just that the server is running. A challenge with creating a good health check endpoint is that if you use other backend services, you need to check that they are available to provide positive confirmation that your service is ready to run. If the services it is dependent on are not available, it should not be available.
  • If a health check fails the instance group, it will remove the failing instance and create a new one. Health checks can also be used by load balancers to determine which instances to send requests to.