Fact-checked by Grok 2 weeks ago

Service level indicator

A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is provided, typically from the perspective of the end user, such as request success rate, , or . In the context of (SRE), SLIs serve as the foundational metrics for assessing service health and reliability, often calculated as a of "good" events (e.g., successful requests) to total events over a specified time window. SLIs are integral to the SRE framework, where they underpin Service Level Objectives (SLOs)—target reliability levels derived from SLIs, such as achieving 99% of requests under 100 milliseconds—and Service Level Agreements (SLAs), which are contractual commitments with potential consequences for non-compliance. Common examples of SLIs include: When implementing SLIs, practitioners emphasize starting with simple, user-focused metrics derived from existing data sources like server logs or client instrumentation, while avoiding overly complex aggregates like averages in favor of percentiles for accuracy. These indicators enable real-time monitoring, alerting on deviations (e.g., via tools like ), and informed decisions on to balance reliability with . By focusing on the "four golden signals" of monitoring—, , errors, and —SLIs align operational practices with business outcomes in large-scale distributed systems.

Definition and Fundamentals

Definition

A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is provided. In the context of (SRE), an SLI focuses on user-centric metrics, such as or , to assess from the end-user's perspective, distinguishing it from internal system metrics that primarily track health without necessarily correlating to . SLIs originated within Google's SRE practices and were formally introduced in the 2016 book Site Reliability Engineering: How Google Runs Production Systems, which prioritizes end-user experience as the core basis for reliability measurements over isolated operational indicators. The calculation of an SLI typically follows the formula: \text{SLI} = \left( \frac{\text{number of good events}}{\text{total valid events}} \right) \times 100\% where "good" events are defined by service-specific thresholds, such as requests served within 100 milliseconds for latency-sensitive services. This approach provides a foundational quantitative basis for understanding service reliability by evaluating performance against predefined criteria.

Key Characteristics

Service level indicators (SLIs) are fundamentally user-centric, designed to capture the aspects of service performance that directly impact end-user experience rather than internal system metrics. For instance, success rates are evaluated from the client's perspective, ensuring that measurements align with how users perceive reliability, such as whether a request completes as expected from their viewpoint. This approach prioritizes observable outcomes that matter to customers, avoiding proxies that might overlook real-world interactions. SLIs must be quantifiable and objectively measurable to enable consistent tracking and analysis. They are typically expressed as ratios (e.g., successful events divided by total events), percentages, or percentiles, allowing for precise evaluation over defined time windows, such as 28-day rolling periods, to assess long-term health. This format facilitates automated collection and aggregation from systems, providing a clear basis for reliability assessments without subjective interpretation. A key property of effective SLIs is their specificity, focusing on a single, well-defined aspect of to avoid dilution or ambiguity in measurements. For example, an SLI might target error rate as the proportion of failed requests, rather than a composite that blends multiple factors, ensuring targeted insights into particular reliability dimensions. This narrow scope promotes clarity in and prioritization of issues. SLIs are engineered to be actionable, serving as the foundation for alerting mechanisms and improvement initiatives when performance deviates from expected norms. By establishing thresholds tied to these indicators, teams can promptly detect breaches—such as spikes in errors—and initiate responses like capacity adjustments or , thereby maintaining service reliability proactively. This design integrates SLIs into operational workflows, transforming raw data into drivers for decisions.

Types of Service Level Indicators

The Four Golden Signals

The four golden signals of , , errors, and —provide a minimal yet comprehensive for assessing the of user-facing services in (SRE). Introduced in Google's 2016 SRE book as a practical set of metrics, these signals focus on symptoms of user-perceived issues and imminent problems, enabling teams to prioritize efforts without overwhelming complexity. By concentrating on these four, SRE practitioners can achieve effective coverage for most distributed systems, as they capture the essential dimensions of service performance and reliability from the end-user perspective. Latency measures the time taken to service a request, emphasizing the of response times rather than simple averages to for variability. It is crucial to distinguish between for successful requests and failed ones, such as fast HTTP 500 errors, which should be tracked separately to avoid masking performance issues. For instance, engineers often target the to remain below 200 milliseconds for web services, while also tail effects that impact a small but critical portion of users. Traffic quantifies the overall demand placed on the , serving as a baseline for and load analysis. Common metrics include requests per second for HTTP-based services or concurrent sessions for streaming applications, helping differentiate between healthy increases in usage and signs of overload. This signal enables proactive , as in traffic can reveal underlying bottlenecks before other issues arise. Errors track the rate of failed or degraded requests, providing insight into reliability from the user's viewpoint. These include explicit failures like HTTP 5xx server errors at the load balancer level, as well as implicit ones such as incorrect content delivery or policy violations detected through end-to-end tests. Distinguishing between total failures (e.g., outright rejections) and partial ones (e.g., timeouts) is essential, with error budgets often calculated as the proportion of erroneous requests to total traffic. Saturation gauges how close a is to its limits, indicating potential for future . Metrics might include CPU utilization exceeding 80%, high usage, or I/O queue depths, which signal the need for to prevent cascading failures. For example, predictive alerts could warn if disk space will fill within four hours, allowing time for . This signal complements the others by focusing on internal constraints that indirectly affect . Together, these signals suffice for most services because they address the primary axes of user impact—speed, volume, failures, and capacity—while remaining user-focused and actionable for alerting and troubleshooting in SRE practices.

Other Common SLIs

Beyond the four golden signals, service level indicators (SLIs) in specialized domains like , content delivery, and non-web applications focus on aspects such as operational uptime, processing efficiency, data timeliness, and feature reach. These metrics ensure reliability in scenarios where depends on consistent data handling or system stability rather than just request performance. , or uptime, quantifies the proportion of time a remains operational and responsive to user requests, typically measured as the percentage of successful probes or responses over a defined period. For always-on services like cloud infrastructure, this SLI is calculated as \left( \frac{\text{successful probes}}{\text{total probes}} \right) \times 100\%, where probes simulate user interactions to verify service usability. , for instance, targets 99.95% availability to support mission-critical workloads. Throughput measures the rate at which a processes or transactions, essential for batch jobs or streaming pipelines where volume impacts viability. In data-intensive environments, it is often expressed as the proportion of time units during which the processing rate exceeds a minimum , such as transactions per minute or bytes per second. For example, a streaming might aim for sustained throughput above 1,000 events per second to maintain analytics. Freshness evaluates the timeliness of data in services like caches or analytics platforms, defined as the proportion of valid data elements updated within a specified time threshold since the last refresh. This SLI is critical for applications requiring current information, such as recommendation engines, and can be computed as \frac{\text{updated data elements}}{\text{total valid data elements}}, with targets like less than 5 minutes for cache validity to prevent stale content delivery. Google Cloud's monitoring tools support freshness SLIs by tracking the age of the oldest data element against such thresholds. Coverage assesses the extent to which a service delivers expected content or processes intended data, particularly in content distribution networks or A/B testing frameworks, measured as the percentage of users or records receiving the targeted features or updates. For content delivery, this might track the success rate of feature rollout, such as 99% of users accessing a new UI variant without fallback to defaults. In data processing contexts, it is the proportion of valid input successfully handled, ensuring comprehensive system operation. In non-web contexts, SLIs adapt to domain-specific reliability needs; for , query correctness serves as a key indicator, representing the proportion of queries yielding accurate results against known benchmarks or curated test data. This is vital for storage systems, where even fails if outputs are erroneous, often verified through periodic audits of read/write . For applications, crash-free sessions measures as the percentage of user sessions that complete without termination due to errors, calculated as $1 - \left( \frac{\text{crashed sessions}}{\text{total sessions}} \right). Crashlytics recommends targets above 99% to sustain user trust in high-engagement apps like or finance tools.

Relationship to SLOs and SLAs

Service Level Objectives (SLOs)

Service level objectives (SLOs) are specific, measurable targets set for service level indicators (SLIs), defining the desired level of reliability for a service over a given time period, such as achieving 99.9% availability for requests over a month. These objectives serve as internal reliability budgets for engineering teams, guiding decisions on when to prioritize stability versus new feature development. A key concept associated with SLOs is the , which represents the allowable amount of unreliability or permitted before violating the objective, calculated as 100% minus the SLO target—for instance, a 99.9% SLO allows a 0.1% . This , often tracked over weekly or monthly windows, enables teams to balance innovation and stability by permitting controlled risks, such as rapid releases, as long as the overall reliability target is met; exceeding the budget shifts focus to remediation efforts. SLOs are established by analyzing historical SLI data, assessing customer expectations for performance, and evaluating the business impact of potential failures, with targets typically ranging from 99.0% to 99.99% for critical services to align with user tolerance for disruptions. For example, a service might set an SLO of 99% of requests completing under 100 milliseconds, derived from user surveys and revenue loss models indicating that higher latencies affect satisfaction. For services with multiple SLIs, separate SLOs are often defined for each to reflect overall service health, such as targets for both and error rates.

Service Level Agreements (SLAs)

Service level agreements (SLAs) represent formal, contractual commitments made by service providers to their customers, specifying guaranteed levels of service reliability and performance, typically measured against service level indicators (SLIs) and derived from internal service level objectives (SLOs). These agreements outline explicit targets, such as 99.5% uptime over a monthly period, and are designed to be more conservative than internal SLOs to buffer against natural variability in service delivery. Unlike SLOs, which serve as nuanced, internal targets for guiding engineering decisions without direct repercussions, SLAs are external-facing and operate on a met-or-not-met basis, triggering predefined consequences when breached. This distinction ensures that SLAs focus on customer accountability rather than operational flexibility, with SLOs forming the foundational targets from which SLA thresholds are conservatively set. When an is violated—determined through ongoing SLI measurements tied to the agreed SLOs—providers must enact remedies, which commonly include financial credits proportional to the (e.g., service credits equaling a percentage of monthly fees) or escalated support priorities to restore . These penalties incentivize reliability while providing customer recourse, often negotiated by and legal teams in with reliability engineers. The concept of SLAs traces its origins to traditional IT service management frameworks like ITIL, where they were formalized as key components of service level management processes starting in the early 2000s to align IT services with needs. In the post-2010s era, the adoption of (SRE) principles—popularized by Google's practices—has evolved SLAs into more integrated tools within modern cloud and ecosystems, emphasizing measurable reliability commitments alongside agile development.

Implementation and Best Practices

Defining Effective SLIs

Defining effective service level indicators (SLIs) involves a structured that ensures these metrics accurately reflect and service reliability. By aligning SLIs with business objectives, organizations can prioritize improvements that matter most to users, avoiding irrelevant or overly complex measurements. This approach draws from established (SRE) practices, emphasizing simplicity and iteration to build robust indicators. The first step is to identify critical user journeys, which represent the key paths users take to achieve their goals with the service. These journeys, such as logging in, searching for products, or completing a checkout in an application, serve as the foundation for selecting relevant SLIs. Focusing on these paths ensures that indicators capture what users perceive as the service's , rather than internal system metrics alone. For instance, in an service, critical journeys might include product search and purchase completion. Next, select metrics that proxy for user happiness, often drawing from established types like the four golden signals—latency, traffic, errors, and saturation—or other relevant measures such as availability or freshness. The chosen metrics should be quantifiable and directly tied to user journeys, forming the basis for SLIs expressed as ratios of good events to total events. For example, request latency can be selected as an SLI for a search service, where it measures the time users wait for results. Prioritize a small set of metrics that cover the service comprehensively without redundancy. Then, define what constitutes "good" versus "bad" events using clear to distinguish acceptable from failures. A good event might be a request completing in under 500 milliseconds, while anything exceeding that is bad, capturing both typical and experiences through percentiles like the 99th. This allows SLIs to be calculated as success ratios, such as the proportion of requests succeeding within the . For an service, good events could include responses without errors and below a 450-millisecond . Subsequently, choose appropriate time windows and sampling methods to aggregate data reliably while minimizing noise from transient issues. Common windows include a 30-day rolling period for overall reliability assessment, with shorter intervals like one week for more frequent reviews, ensuring SLIs reflect sustained performance. Sampling should be consistent and representative, such as evaluating every request or using stratified samples from logs to avoid . For example, latency SLIs might aggregate over four-week windows using server-side metrics sampled every 10 seconds. Best practices recommend starting with simple SLIs and iterating based on real-world data and user feedback, refining thresholds and metrics as the service evolves. Aim for 3 to 5 SLIs per service to maintain focus and avoid over-engineering, which can lead to alert fatigue or misprioritization. This iterative approach, combined with stakeholder documentation, ensures SLIs remain aligned with evolving user needs.

Monitoring and Measurement

Monitoring and measurement of indicators (SLIs) involve collecting data from diverse sources to ensure accurate representation of service reliability. Primary data sources include client-side , which captures end-user experiences such as browser-based or response times; server logs, which record internal events like HTTP error rates or request durations; and synthetic probes that simulate user interactions to test external availability and performance. Tools like facilitate log and metric collection through scraping endpoints, while supports client-side and via agent integrations and global probe networks. These sources enable continuous SLI tracking, with effective SLI definitions serving as a prerequisite for reliable . Aggregation methods transform raw SLI data into actionable insights over defined time periods. Rolling windows, such as 1-minute or 28-day intervals, smooth fluctuations by averaging or summing metrics like request success rates, preventing short-term anomalies from skewing long-term assessments. For distribution-based SLIs like , percentiles—particularly the (99th )—are used to quantify performance, ensuring that 99% of requests fall below a threshold (e.g., 500 ms) while inherently handling outliers by excluding the slowest 1%. Platforms like Nobl9 apply these aggregations dynamically, selecting min/max or percentile operators based on threshold directions to maintain precision without overemphasizing extremes. Alerting mechanisms notify teams when SLIs deviate from service level objectives (SLOs), enabling proactive remediation. Threshold-based notifications trigger when an SLI, such as error rate, exceeds an SLO target (e.g., >0.1% over 10 minutes for a 99.9% SLO), often using multi-burn-rate alerts to detect rapid error budget consumption. Integration with incident management tools like allows these alerts to escalate via or , routing notifications to on-call responders based on severity. This approach prioritizes user-impacting issues, reducing noise from minor fluctuations. Automation embeds SLI measurement into development workflows for seamless reliability validation. In CI/CD pipelines, continuous SLI checks—such as or rate gates—evaluate deployments against SLOs before promotion, using tools like Buildkite or Keptn with data to halt faulty releases. Error budgets guide these processes, permitting deployments when budgets are available but enforcing rollbacks if SLIs indicate violations, thereby balancing and . Challenges in SLI accuracy often stem from , where measurements favor "golden users" or proxy metrics (e.g., server latency over end-to-end client experience), leading to optimistic views of reliability. This bias can be mitigated through diverse synthetic probes distributed across global locations and user scenarios, as implemented in Datadog's monitoring, to better approximate real-world variability and reduce discrepancies between internal logs and actual user telemetry.

Examples and Applications

Real-World Examples

In web services, applies and error rate as primary service level indicators to maintain streaming reliability. These indicators align with the four signals framework, allowing to balance with system capacity through proactive load shedding and . In cloud infrastructure, (AWS) defines as a core service level indicator for its S3 object storage service, aiming for 99.99% over a monthly period, achieved through replication across multiple availability zones and measured via error rates from regional synthetic probes that simulate user requests. This approach ensures high durability and accessibility for , with the indicator directly informing service credits if thresholds are not met. For platforms, monitors throughput as a service level indicator during peak events like Prime Day, sustaining order processing rates exceeding 12,000 orders per second without performance degradation by scaling services such as DynamoDB to handle over 200 million requests per second. This metric captures the system's ability to manage traffic surges, preventing bottlenecks in checkout and fulfillment processes. Beyond technology sectors, healthcare systems employ freshness as a service level indicator for patient monitoring, emphasizing data updates to enable timely clinical decisions, such as in remote tracking where analytics process streaming inputs from wearable devices. Such indicators support proactive interventions by ensuring data recency in electronic health records and platforms. In the , evolved its internal indicators, building on the golden signals to accommodate emerging workloads. As of 2025, recent developments include integrating SLIs with for applications, as outlined in updated SRE practices, to measure latency in hybrid environments.

Challenges and Solutions

One significant challenge in employing service level indicators (SLIs) is metric gaming, where teams optimize specifically for the chosen metrics at the potential expense of overall , a phenomenon akin to in practice. For instance, prioritizing request success rates might lead to de-emphasizing critical user journeys, such as new user , while favoring less impactful operations like background tasks. To mitigate this, organizations conduct regular audits to evaluate SLIs against real user impacts and adopt multi-metric approaches that balance availability, latency, and other dimensions for a more comprehensive reliability assessment. In architectures, scalability poses another hurdle, as the proliferation of services can result in an explosion of individual SLIs, complicating management and aggregation across distributed systems. This distributed nature increases complexity, with challenges in correlating metrics from interdependent components to reflect end-to-end user journeys. A effective solution involves hierarchical aggregation, distinguishing service-level SLIs from team- or system-level composites, often using critical user journeys to synthesize data without overwhelming oversight. SLIs can become outdated as services evolve, failing to capture shifts in user expectations or technical landscapes, which undermines their reliability as measures. To address this, teams implement quarterly reviews integrated with user loops, adjusting indicators based on , stakeholder input, and changing priorities to maintain . Organizational resistance from development teams, often stemming from perceived added overhead or conflicts with velocity goals, further complicates SLI adoption. SRE-led workshops, such as those introducing SLIs through practical exercises on user-focused metrics and error budgets, foster buy-in by demonstrating tangible benefits like balanced risk and improved decision-making. As of 2025, evolving trends include integrating for predictive SLI adjustments, where models analyze patterns to proactively tune indicators and anticipate violations, enhancing in dynamic environments. Additionally, post-pandemic demands for robust remote services have amplified the need for SLIs that account for hybrid work patterns, such as variable in distributed access, prompting adaptations in to ensure consistent performance amid increased remote reliance.

References

  1. [1]
    Defining slo: service level objective meaning - Google SRE
    An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided. Most services consider ...
  2. [2]
    Chapter 2 - Implementing SLOs - Google SRE
    Here, service level indicators come into play: an SLI is an indicator of the level of service that you are providing. While many numbers can function as an SLI ...
  3. [3]
    Google SRE monitoring ditributed system - sre golden signals
    Google's internal infrastructure is typically offered and measured against a service level objective (SLO; see Service Level Objectives). Many years ago ...
  4. [4]
    Site Reliability Engineering [Book] - O'Reilly
    Site Reliability Engineering (SRE) is about the entire lifecycle of software systems, focusing on principles, practices, and management, as explained by Google ...
  5. [5]
    [PDF] SLO Adoption and Usage in Site Reliability Engineering
    Apr 1, 2020 · In this chapter we define common service-level terminology and detail how organizations can leverage SLOs as powerful business tools. Defining ...
  6. [6]
  7. [7]
    [PDF] [PUBLIC] The Art of SLOs – Participant Handbook - Google SRE
    The suggested specification for a request/response Availability SLI is: The proportion of valid requests served successfully. Turning this specification into an ...
  8. [8]
    Data processing services | Google Cloud Observability
    You can express a freshness SLI using this metric by using a DistributionCut structure. The following example SLO expects that the oldest data element is ...
  9. [9]
    Site Reliability Engineering: Demystifying SLIs, SLOs and error ...
    Oct 21, 2020 · Data freshness: The proportion of valid data updated more recently than a threshold. For example: 99 % of a hypothetical/service should be ...<|separator|>
  10. [10]
    Understand crash-free metrics | Firebase Crashlytics - Google
    The crash-free sessions metric is the percentage of sessions that happened during a selected time period and did not end in a crash. Sessions without crashes ...
  11. [11]
    Error Budget Policy for Service Reliability - Google SRE
    An error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget. If our service receives 1,000,000 requests in four weeks, a ...<|control11|><|separator|>
  12. [12]
    SRE fundamentals: SLAs vs SLOs vs SLIs | Google Cloud Blog
    Jul 19, 2018 · Service-Level Indicator (SLI). We also have a direct measurement of a service's behavior: the frequency of successful probes of our system.
  13. [13]
    SRE fundamentals: SLI vs SLO vs SLA | Google Cloud Blog
    May 8, 2021 · Our Service-Level Indicator (SLI) is a direct measurement of a service's behavior, defined as the frequency of successful probes of our system.Missing: formula | Show results with:formula<|control11|><|separator|>
  14. [14]
    SLOs, SLIs, and SLAs: Meanings & Differences | New Relic
    Dec 18, 2024 · Service level indicators (SLIs) are the key measurements and metrics to determine the availability of a system. Service level agreements (SLAs) ...
  15. [15]
    ITIL Service Level Management Best Practices - Alloy Software
    Jun 21, 2024 · A Service Level Agreement (SLA) is a formal agreement between a service provider and their customer, detailing the scope of services to be ...
  16. [16]
    Synthetic Testing and Monitoring - Datadog Docs
    Synthetic tests allow you to observe how your systems and applications are performing using simulated requests and actions from around the globe.Getting Started with Synthetic... · Search and Manage Synthetic... · Browser TestingMissing: SLI | Show results with:SLI
  17. [17]
    SLI aggregations in Nobl9 | Nobl9 Documentation
    ### Summary of SLI Aggregation Methods
  18. [18]
    Prometheus Alerting: Turn SLOs into Alerts - Google SRE
    In order to generate alerts from service level indicators (SLIs) and an error budget, you need a way to combine these two elements into a specific rule.Ways To Alert On Significant... · 5: Multiple Burn Rate Alerts · Low-Traffic Services And...
  19. [19]
    Making the Most of PagerDuty + Datadog
    Oct 10, 2019 · ... SLI crosses a threshold. When you integrate PagerDuty with Datadog, an alert in Datadog can immediately trigger an incident in PagerDuty ...Missing: notifications | Show results with:notifications
  20. [20]
    Applying SRE principles to CI/CD | Using SLOs, SLIs & Error budgets
    Sep 1, 2023 · Slow, unreliable CI/CD? Learn how to use SLOs, SLIs, and Error Budgets to maintain focus, prioritize effort, and rebuild developer trust in ...
  21. [21]
    Implementing SLI/SLO based Continuous Delivery Quality Gates ...
    Apr 6, 2020 · In this article we will focus on using Keptn for Continuous Delivery with Prometheus-based SLIs to evaluate quality gates.Missing: measurement | Show results with:measurement
  22. [22]
    Improve SLO accuracy and performance with Datadog Synthetic ...
    Jun 26, 2025 · Synthetic SLOs enable you to confirm that your SLIs are being collected and calculated correctly, as well as help you anticipate critical performance issues.Missing: SLI side server Prometheus
  23. [23]
    Enhancing Netflix Reliability with Service-Level Prioritized Load ...
    Jun 24, 2024 · A failure in only pre-fetch requests does not result in a playback failure, but slightly increases the latency between pressing play and video ...Missing: SLI p95
  24. [24]
    Keeping Customers Streaming — The Centralized Site Reliability ...
    May 27, 2020 · From failure injection testing to regularly exercising our region evacuation abilities, Netflix engineers invest a lot in ensuring the services ...Missing: SLI golden signals
  25. [25]
    Data protection in Amazon S3 - Amazon Simple Storage Service
    Backed with the Amazon S3 Service Level Agreement. · Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year.
  26. [26]
    Amazon S3 Service Level Agreement - AWS
    Nov 28, 2023 · This Amazon S3 Service Level Agreement (SLA) is a policy governing the use of Amazon S3 and applies separately to each account using Amazon S3.Amazon S3 Service Level... · Service Credits · Amazon S3 Sla Exclusions
  27. [27]
    AWS services scale to new heights for Prime Day 2025: key metrics ...
    Aug 26, 2025 · DynamoDB maintained high availability while delivering single-digit millisecond responses and peaking at 151 million requests per second. Amazon ...Missing: SLI | Show results with:SLI
  28. [28]
    Real-Time Healthcare Analytics: How Leveraging It Improves Patient ...
    In this post, we'll explore how leveraging real-time healthcare analytics ensures seamless patient care and a smoother workflow for your team.<|separator|>
  29. [29]
    The Evolution of SRE at Google | USENIX
    Dec 18, 2024 · Google's SRE team has pioneered methods to keep failures rare by engineering reliability into every part of the stack. SREs have scaled up ...Missing: SLI 2020<|separator|>
  30. [30]
    Monitoring Microservices: A Best Practices Guide - Nobl9
    Challenges in monitoring microservices. Monitoring in a distributed architecture presents unique challenges. Here are the key challenges involved. Complexity ...
  31. [31]
    Art of slo | customer reliability engineering - Google SRE
    The goal of the workshop is to introduce participants to the way Google measures service reliability—in terms of Service Level Indicators (SLIs) and Service ...Missing: definition | Show results with:definition
  32. [32]
    The SRE Playbook 2025: Engineering Resilience in the Age of AI ...
    Nov 3, 2025 · Explore the 2025 SRE Playbook to see how AI and automation reshape reliability, observability, and engineering resilience.
  33. [33]
    Reimagining the postpandemic workforce - McKinsey
    Jul 7, 2020 · Pandemic-style working from home may not translate easily to a “next normal” mix of on-site and remote working.Missing: indicator | Show results with:indicator<|separator|>