Maintenance & Monitoring

Cards (28)

  • A key benefit of a microservice architecture is the ability to independently deploy microservices. This means that the service API has to be protected. Versioning is required, and when new versions are deployed, care must be taken to ensure backward compatibility with the previous version. Some simple design rules can help, such as indicating the version in the URI and making sure you change the version when you make a backwardly incompatible change. Deploying new versions of software always carries risk. We want to make sure we test new versions effectively before going live
  • Rolling updates
    • Allow you to deploy new versions with no downtime. The typical configuration is to have multiple instances of a service behind a load balancer. A rolling update will then update one instance at a time. This strategy works fine if the API is not changed or is backward compatible, or if it is ok to have two versions of the same service running during the update.
  • Rolling Updates
    • If you are using instance groups, rolling updates are a built-in feature. You just define the rolling update strategy when you perform the update. For Kubernetes, rolling updates are there by default; you just need to specify the replacement Docker image. Finally, for App Engine, rolling updates are completely automated.
  • Blue/Green Deployments:
    • Use a blue/green deployment when you don’t want multiple versions of a service to run simultaneously. Blue/green deployments use two full deployment environments. The blue deployment is running the current deployed production software, while the green deployment environment is available for deploying updated versions of the software.
  • Blue/Green Deployments:
    • When you want to test a new software version, you deploy it to the green environment. Once testing is complete, the workload is shifted from the current (blue) to the new (green) environment. This strategy mitigates the risk of a bad deployment by allowing the switch back to a previous deployment if something goes wrong.
  • Blue/Green Deployments:
    • For Compute Engine, you can use DNS to migrate requests, while in Kubernetes you can configure your service to route to new pods using labels, which is just a simple configuration change. App Engine allows you to split traffic
  • Canary Releases:
    • You can use canary releases prior to a rolling update to reduce risk. With a canary release, you make a new deployment with the current deployment still running. Then you send a small percentage of traffic to the new deployment and monitor it. Once you have confidence in your new deployment, you can route more traffic to the new deployment until 100% is routed this way.
  • Canary Releases:
    • In Compute Engine, you can create a new instance group and add it to the load balancer as an additional backend.
    • In Kubernetes, you can create a new pod with the same labels as the existing pods. The service will automatically divert a portion of the requests to the new pod.
    • In App Engine, you can again use the traffic splitting feature to drive a portion of traffic to the new version.
  • Capacity Planning
    • Start with a forecast that estimates the capacity needed. Monitor and review this forecast. Then allocate by determining the resources required to meet the forecasted capacity. This allows you to estimate costs and balance them against risks and rewards. Once the design and cost is approved, deploy your design and monitor it to see how accurate your forecasts were. This feeds into the next forecast as the process repeats.
    • Treat capacity planning not as a one off task, but as a continuous, iterative cycle, as illustrated on this slide.
  • Compute Costs:
    • A good starting point for anybody working on cost optimization is to become familiar with the VM instance pricing. It is often beneficial to start with a couple of small machines that can scale out through auto scaling as demand grows. To optimize the cost of your virtual machines, consider using committed use discounts, as these can be significant. Also, if your workloads allow for preemptible instances, you can save up to 80% and use auto healing to recover when instances are preempted.
  • Compute Costs:
    • Compute Engine also provides sizing recommendations for your VM instances, as shown on the right. This is a really useful feature that can help you select the right size of VM for your workloads and optimize costs.
  • Optimizing Disk Costs:
    • A common mistake is to over-allocate disk space. This is not cost-efficient, but selecting a disk is not just about size. It is important to determine the performance characteristics your applications display: the I/O patterns, do you have large reads, small writes, vice versa, mainly read-only data? This type of information will help you select the correct type of disk. As the table shows, SSD persistent disks are significantly more expensive than standard persistent disks. Understanding your I/O patterns can help provide significant savings.
  • Networking Costs:
    • To optimize network costs, it is best practice to keep machines as close as possible to the data they need to access. This graphic shows the different types of egress: within the same zone, between zones in the same region, intercontinental egress, and internet egress. It is important to be aware of the egress charges. These are not all straightforward. Egress in the same zone is free. Egress to a different Google Cloud service within the same region using an external IP address or an internal IP address is free, except for some services such as Memorystore for Redis.
  • Networking Costs:
    • Egress between zones in the same region is charged and all internet egress is charged. One way to optimize your network costs is to keep your machines close to your data.
  • GKE Usage Metering:
    • Another way to optimize cost is to leverage GKE usage metering, which can prevent over-provisioning your Kubernetes clusters.
    • With GKE usage metering, an agent collects consumption metrics in addition to the resource requests by polling PodMetrics objects from the metrics server. The resource request records and resource consumption records are exported to two separate tables in a BigQuery dataset that you specify. Comparing requested with consumed resources makes it easy to spot waste and take corrective measures.
  • GKE Usage Metering:
    • This graphic shows a typical configuration where BigQuery is used for request-based metrics collected from the usage metering agent and, together with data obtained from billing export, it is analyzed in a Data Studio dashboard.
  • It’s important to compare the costs of the different options as well as their characteristics. In other words, your storage and database service choice can make a significant difference to your bill.
  • Your architectural design can also help you optimize your costs.
    • For example, if you use Cloud CDN for static content or Memorystore as a cache, you can save instead of allocating more resources, Similarly, instead of using a datastore between two applications, consider messaging/queuing with Pub/Sub to decouple communicating services and reduce storage needs.
  • Pricing Calculator
    • The pricing calculator should be your go-to resource for estimating costs. Your estimates should be based on your forecasting and capacity planning. The tool is great for comparing costs of different compute and storage services, and you will use it in the upcoming design activity.
  • Billing Reports:
    • To monitor the costs of your existing service, leverage the Cloud Billing Reports page as shown here. This report shows the changes in costs compared to the previous month, and you can use the filters to search for particular projects, products, and regions, as shown on the right. The sizing recommendations for your Compute Engine instances will also be in this report.
  • Google Data Studio
    • You can even visualize spend over time with Google Data Studio, which turns your data into informative dashboards and reports that are easy to read, easy to share, and fully customizable. The service data is displayed in a daily and monthly view, providing at-a-glance summaries that can also be drilled down in to provide greater insights.
  • Budgets:
    To help with project planning and controlling costs, you can set a budget. Setting a budget lets you track how your spend is growing toward that amount. This screenshot shows the budget creation interface:
    1. Set a budget name and specify which project this budget applies to.
    2. 2. Set the budget at a specific amount or match it to the previous month's spend.
    3. 3. Set the budget alerts. These alerts send emails to Billing Admins after spend exceeds a percent of the budget or a specified amount.
  • Budgets:
    • In our case, it would send an email when spending reaches 50%, 90%, and 100% of the budget amount. You can even choose to send an alert when the spend is forecasted to exceed the percent of the budget amount by the end of the budget period.
    • In addition to receiving an email, you can use Pub/Sub notifications to programmatically receive spend updates about this budget. You could even create a Cloud Function that listens to the Pub/Sub topic to automate cost management.
  • Monitoring Dashboards:
    • Dashboards are one way for you to view and analyze metric data that is important to you. This includes your SLIs to ensure that you are meeting your SLAS. The Monitoring page of the Cloud Console automatically provides predefined dashboards for the resources and services that you use. It is important that you monitor the things you pay for to determine trends, bottlenecks, and potential cost savings.
  • Monitoring Dashboards:
    • Here is an example of some charts in a Monitoring dashboard. On the left you can see the CPU usage for different Compute Engine instances, and on the right is the ingress traffic for those instances. Charts like these provide valuable insights into usage patterns.
  • To help you get started, Cloud Monitoring creates default dashboards for your project resources, as shown in this screenshot. You can also create custom dashboards, which you can explore
  • Uptime Checks:
    • Now, it’s a good idea to monitor latency, because it can quickly highlight when problems are about to occur. As shown on this slide, you can easily create uptime checks to monitor the availability and latency of your services. So far there is a 100% uptime with no outages. Latency is actually one of the four golden signals called out in Google’s site reliability engineering
  • Alerts:
    • Your SLO will be more strict than your SLA, so it is important to be alerted when you are not meeting an SLO because its an early warning that the SLA is under threat. Here is an example of what creating an alerting policy looks like. On the left, you can see an HTTP check condition on the summer01 instance. This will send an email that is customized with the content of the documentation section on the right.