GCP Observability

Subdecks (4)

Cards (276)

  • Role of Monitoring
    • Definition: Monitoring involves collecting, processing, aggregating, and displaying real-time quantitative data about a system.
    • Key Aspects: Query counts, error types, processing times, server lifetimes.
  • Product Reliability
    • Deployment: Even the best products need deployment into environments with sufficient capacity.
    • Testing: Thorough testing, automated testing, continuous integration, and continuous development are crucial.
    • Post Mortems: Essential for letting clients know why incidents happened and why they're unlikely to recur.
  • Four Golden Signals
    • Latency: Measures how long a part of a system takes to return a result, impacting user experience.
    • Traffic: Indicates current system demand and is used for capacity planning.
    • Saturation: Focuses on how full the service is, an indicator of degrading performance.
    • Errors: Measure system failures or issues, indicating configuration or capacity problems.
  • Observability Concept
    • Signals: Metric, logging, and trace data captured and indicated in Google products.
    • Flow: Signal data flows into Google Cloud Operations tools for visualization and analysis.
    • Tools: Cloud monitoring, Cloud logging, error reporting, Cloud trace, and Cloud profiler are key for operations roles.
  • Four Recurring User Needs for Observability
    1. Visibility into System Health
    • Clear environmental model for application workings on Google Cloud.
    • Reports on overall health, answering questions like system functionality and resource availability.
    1. Error Reporting and Alerting
    • Proactive alerting, anomaly detection, and guidance on issues.
    • Avoidance of manual dot connecting; services should provide meaningful direction.
  • Four Recurring User Needs for Observability
    1. Efficient Troubleshooting
    • System should proactively correlate relevant systems.
    • Easy search across different data sources like logs and metrics.
    • Opinions on potential causes and recommendations for investigation.
    1. Improved Performance
    • Retrospective analysis for intelligent planning.
    • Analysis of trends and understanding how changes affect system performance.
  • Introduction to Cloud Monitoring
    • Objective: Uncover the intricacies of Google Cloud Monitoring for a comprehensive understanding.
    • Context: Cloud monitoring is pivotal for gaining insights into the health, performance, and uptime of applications powered by Google Cloud.
  • Data Collection
    • Comprehensive Scope:
    • Gathers data from diverse sources, including projects, logs, services, systems, agents, custom code, and common application components like Cassandra, Nginx, Apache Web Server, Elastic Search.
    • Data Types:
    • Metrics, events, and metadata are collected, providing a holistic view of the operational landscape.
  • Cloud Monitoring: Data Processing and Insights Generation
    • Ingestion Process:
    • Ingests the collected data and transforms it into actionable insights.
    • Output Channels:
    • Presents insights through visually appealing dashboards, Metrics Explorer charts, and automated alerts for proactive issue resolution.
  • Cloud Monitoring: Advanced Capabilities
    • Automatic Free Ingestion:
    • Covers a broad spectrum of 100+ monitored resources, ensuring comprehensive coverage without additional costs.
    • Metrics Availability:
    • Access to over 1,500 metrics without incurring extra expenses.
    • Open Source Standards:
    • Leverages industry-standard open-source tools like Prometheus and OpenTelemetry for collecting metrics across various compute workloads.
  • Cloud Monitoring: Contextual Visualization and Alerts
    • In-Context Visualization:
    • Provides the ability to view relevant telemetry data alongside workloads across Google Cloud, facilitating a contextual understanding of the operational environment.
    • Alert Mechanisms:
    • Generates automated alerts based on monitored data, enabling timely responses to potential issues and ensuring proactive management.
  • Google Cloud Logging: In-Depth Analysis1. Introduction to Cloud Logging
    • Objective: Explore the capabilities and features of Google's Cloud Logging service.
    • Overview: Cloud Logging enables users to collect, store, search, analyze, monitor, and alert on log entries and events.
    2. Automated Logging Integration
    • Integration Points:
    • Automated logging is integrated into various Google Cloud products, including:
    • App Engine
    • Cloud Run
    • Compute Engine VMs with the logging agent
    • Google Kubernetes Engine (GKE)
  • Cloud Logging: Features for Managing and Exploring Logs
    • Automatic Log Ingestion:
    • Simplifies log management with automatic and easy log ingestion.
    • Immediate Ingestion:
    • Instantly ingests logs from Google Cloud services across the entire stack for quick insights.
    • Exploration Tools:
    • Utilizes tools like Error Reporting, Log Explorer, and Log Analytics for focused exploration within extensive log datasets.
  • Cloud Logging: Customization and Routing
    • Custom Routing and Storage:
    • Enables customization of log routing and storage.
    • Compliance and Business Benefits:
    • Route logs to specific regions or services for compliance and additional business benefits.
    5. Audit and App Logs for Compliance
    • Audit and App Logs Usage:
    • Leverages audit and app logs for compliance patterns and issue tracking.
  • Log Analysis and Export
    • Logs Analysis:
    • Initiates log analysis using integrated Logs Explorer.
    • Export Destinations:
    • Logs can be exported to various destinations, including Google Cloud Storage, PubSub messages, and BigQuery tables.
    • Real-Time Analysis:
    • PubSub messages can be analyzed in real-time using custom log or stream processing technologies like Dataflow.
    • BigQuery facilitates SQL queries for log examination.
  • Cloud Logging: Log-Based Metrics Integration
    • Metrics Creation:
    • Log-based metrics creation for integration into Cloud Monitoring dashboard alerts and service Service Level Objectives (SLOs).
  • Cloud Logging:
    Retention Policies
    • Default Retention:
    • Default retention in cloud logging depends on log type.
    • Data access logs are retained for 30 days by default, configurable up to a maximum of 3,650 days.
    • Admin logs are stored for 400 days by default.
    • Exporting logs to Google Cloud Storage or BigQuery extends retention options.
  • Cloud Logging:
    • Developer Use Case:
    • Quick start with out-of-the-box collection of system metrics and logs.
    • Real-time analysis, debugging, and troubleshooting facilitated with automatic mapping of stack traces to error types.
    • Security Operations (SecOps) Use Case:
    • Achieving authorized access and preventing unauthorized network navigation through audit logs, network telemetry, and streamlined log analysis.
    • Collect VPC, Firewall and Load Balancer, GKE logs and more
  • Operator using Cloud Logging
  • Google Cloud Error Reporting: In-Depth Analysis: Introduction to Error Reporting
    • Objective: Understand the capabilities of Google Cloud Error Reporting.
    • Overview: Error Reporting is designed to count, analyze, and aggregate crashes occurring in running cloud services.
  • Error Reporting:
    Advanced Functionalities for Smooth Application Operation
    • Real-Time Processing:
    • Application errors are processed and displayed in the interface within seconds.
    • Error Visibility:
    • A dedicated page offers detailed insights, including:
    • Error bar chart over time
    • List of affected versions
    • Request URL
    • Link to the request log
  • Error Reporting
    Instant Notification
    • Proactive Alerting:
    • Error Reporting instantly alerts you when a new application error cannot be grouped with existing ones.
    • No need to wait for user reports.
    • Seamless Navigation:
    • Directly jump from a notification to the details of the new error for immediate investigation.
  • Error Reporting:
    Crash Management for Common Languages
    • Exception Handling:
    • Crashes in most common languages are treated as unhandled exceptions and are managed by the code itself.
    • Management Interface:
    • User-friendly interface with sorting and filtering capabilities.
    • Dedicated view shows error details, time chart, occurrences, affected user accounts, first and last seen dates, and a cleaned exceptions tag trace.
  • Error Reporting:
    Error Analysis and Details
    • Error Details View:
    • In-depth view with comprehensive details:
    • Time chart
    • Occurrences
    • Affected user accounts
    • First and last seen dates
    • Cleaned exceptions tag trace
  • Alerting Capabilities
    • Custom Alerts:
    • Create alerts to receive notifications on new errors.
    • Enables proactive monitoring and response to emerging issues.
    7. Conclusion and Significance
    • Immediate Action:
    • Google Cloud Error Reporting ensures quick identification and understanding of application errors.
    • Proactive Monitoring:
    • Real-time processing and instant notifications empower proactive error management.
    • Comprehensive Analysis:
    • The management interface and error details view provide a comprehensive analysis of errors for efficient troubleshooting.
  • Google Cloud Error Reporting Service
  • Google Cloud Profiler:
    Overview:
    • Performance Analysis Tool: Offers a comprehensive CPU and heap snapshot with minimal performance impact.
    • Platform Support: Compatible with Compute Engine VMs, App Engine, and Kubernetes.
    • Resource Analysis: Utilizes statistical techniques and provides an interactive flame graph for understanding resource consumption.
    • Code Behavior: Assists developers in understanding how their code is called and which parts consume the most resources.
  • Google Cloud Trace
    Overview:
    • Latency Data Collection: Captures latency data from distributed applications, displayed in Google Cloud Console.
    • Application Compatibility: Collects traces from App Engine, Compute Engine VMs, and GKE containers, and Cloud Run and from non google cloud environments
    • Real-Time Performance Insights: Provides near real-time performance insights.
    • Continuous Monitoring: Automatically identifies recent changes affecting application performance.
    • Latency Reports: Allows viewing performance insights in real-time.
  • Overall Integration in Google Cloud Operation Suite
    • User-Focused Products: Focuses on SLO monitoring, uptime checks, and tracing for a holistic view.
    • Open and Flexible: Leverages open-source projects like Prometheus, OpenTelemetry, and Fluent Bit for flexibility.
    • Contextual Telemetry: Connects datasets in context across Google Cloud service views.
    • Powerful Analysis and Alerting: Provides meaningful analysis and alerting tools for both automated and human-led resolutions.
  • Google Cloud Profiler:
    • Broad platform support that includes Google Compute Engine VMs, App Engine, in Kubernetes, it allows developers to analyze applications running anywhere, including Google Cloud, other Cloud platforms, or on-premises with support for Java, Go, Python, and Node Js.
    • Cloud Profiler presents the call hierarchy and resource consumption of the relevant function in an interactive flame graph that helps developers understand which parts consume the most resources and the different ways in which their code is actually called.
  • Latency is important because:
    1. It directly affects the user experience.
    2. Changes in latency could indicate emerging issues.
    3. Its values may be tied to capacity demands.
    4. It can be used to measure system improvements.
    But how is it measured? Sample latency metrics include:
    Page load latency
    Number of requests waiting for a thread |
    Query duration
    Service response time
    Transaction duration
    Time to first response
    Time to complete data return
  • Why Traffic is Important:
    1. It’s an indicator of current system demand.
    2. Its historical trends are used for capacity planning.
    3. It’s a core measure when calculating infrastructure spend.
    Sample Traffic Metrics:
    • HTTP requests per second
    • requests for static vs. dynamic content
    • Network I/O
    • concurrent sessions
    • transactions per second
    • retrievals per second
    • active requests
    • write ops
    • read ops
    • active connections
  • Saturation - Importance and Metrics: The third signal is saturation, measuring how close to capacity a system is. Capacity is often subjective, depending on the underlying service or application.
    Importance of Saturation:
    1. It's an indicator of current system demand, reflecting how full the service is.
    2. It focuses on the most constrained resources.
    3. It's frequently tied to degrading performance as capacity is reached.
    Sample Capacity Metrics:
    • % memory utilization
    • % thread pool utilization
    • % cache utilization
    • % disk utilization
    • % CPU utilization
  • Errors - Indications and Sample Metrics: The fourth signal is errors, measuring system failures or other issues. Errors are raised when flaws, failures, or faults in a computer program or system cause it to produce incorrect or unexpected results, or behave in unintended ways.
    Indications that Errors Might Provide:
    1. Configuration or capacity issues
    2. Service level objective violations
    3. Signaling the need to emit an alert
    Sample Error Metrics:
    • Wrong answers or incorrect content
    • 400/500 HTTP codes
    • failed requests
    • exceptions
    • stack traces
    • Servers that fail liveness checks
    • dropped connections
  • Cloud monitoring provides visibility into the performance, uptime, and overall health of cloud-powered applications. It collects metrics, events, and metadata from projects, logs, services, systems, agents, custom code, and various common application components, including Cassandra, Nginx, Apache Web Server, Elasticsearch, and many others. Monitoring ingests that data and generates insights via dashboards, Metrics Explorer charts, and automated alerts.
  • Cloud Monitoring Advanced Capabilities:
    1. Automatic, Free Ingestion:
    • On 100+ monitored resources.
    • Over 1,500 metrics are immediately available with no cost.
    1. Open Source Standards:
    • Leverage Prometheus and Open Telemetry to collect metrics across compute workloads.
    1. Customization for Key Workloads:
    • Cloud Monitoring offers custom visualization capabilities for:
    • GKE through Google Cloud Managed Service for Prometheus.
    • Google Compute Engine through Ops Agent.
    1. In-Context Visualizations & Alerts:
    • View relevant telemetry data alongside your workloads across Google Cloud.
  • Google's Cloud Logging allows users to collect, store, search, analyze, monitor, and alert on log entries and events. Automated logging is integrated into Google Cloud products like App Engine, Cloud Run, Compute Engine VMs running the logging agent, and GKE.
  • Cloud Logging Features:Cloud Logging provides a range of features to make managing and exploring logs easier, including:
    1. Automatic, Easy Log Ingestion:
    • Immediate ingestion from GCP services across your stack.
    1. Gain Insight Quickly:
    • Tools like Error Reporting, Log Explorer, and Log Analytics enable quick focus from large sets of data.
    1. Customize Routing & Storage:
    • Route your logs to the region or service of your choice for additional compliance or business benefits.
    1. Compliance Insights:
    • Leverage audit and app logs for compliance patterns and issues.
  • Google's Cloud Logging allows users to collect, store, search, analyze, monitor, and alert on log entries and events. Automated logging is integrated into Google Cloud products like App Engine, Cloud Run, Compute Engine VMs that run the logging agent, and GKE. You can aggregate and centralize logs at a organizational level, project level and folder level based on your needs.
    • Most log analysis start with Google Cloud’s integrated Logs Explorer. Logging entries can also be exported to several destinations for alternative or further analysis.
    • Export log data as files to Google Cloud Storage, or as messages through Pub/Sub, or into BigQuery tables. Pub/Sub messages can be analyzed in near-real time using custom code or stream processing technologies like Dataflow. BigQuery allows analysts to examine logging data through SQL queries. And archived log files in Cloud Storage can be analyzed with several tools and techniques. Logs-based metrics may be created and integrated into Cloud Monitoring dashboards, alerts, and service SLOs.