SLI, SLO, SLA and Error Budget

When talking about DevOps, the goal is to deliver projects in the best possible way and create the best experience for the user. But to know if we are achieving this goal, it is necessary to measure.

What we think is good is not always actually good.

SLI, SLO, SLA, and Error Budget help align expectations with your user. They are the pillars for measuring the delivery and quality of any IT service.

It is necessary to align these service levels with your team to be successful. They are used to measure and establish goals for the quality of the application and its infrastructure, in addition to aligning with the user the performance and availability of the product we are going to deliver.

Efficiency and Reliability are key pieces in the DevOps world, being the difference between a good service and an excellent one.

Learning about these acronyms involves continuous application improvement.

SLI (Service Level Indicator)

The service level indicator is the metrics that define the proper functioning of the system. Everything depends on what type of system we are talking about, but putting yourself in the user's shoes, what would be a good functioning system? Some metrics that could be considered to measure performance. You don't need to use all of them, it's just an idea, as each system is a different system.

Page load time. A wait of more than 3 seconds can make a customer give up on a purchase. We could create a metric to measure the percentage of times the page loads in less than 2 seconds. What could we do to improve this percentage?
- Improve image size?
- Improve server infrastructure?

Examples:

Availability: Percentage of time the service was available (uptime).
- Example metric value: 99.9%.
Latency: Average API response time (in milliseconds).
- Example metric value: p95 ≤ 200ms.
Error/Success Rate: Percentage of successful requests.
- Example metric value: HTTP 2xx / Total Requests.
Failure Rate/Error Percentage: (HTTP 4xx and 5xx) in relation to total requests. Example: Errors ≤ 1%.
Throughput: Number of requests processed per second (RPS).
- Example metric value: 500 RPS.
Capacity: Average resource utilization, such as CPU, memory, or database.
- Example: CPU usage ≤ 75%.
Saturation: Percentage of queue or simultaneous connection usage.
- Example: ≤ 90% of simultaneous connections.
Retention: Percentage of users who return to the service.
- Example metric value: ≥ 70% of users return within 30 days.

SLO (Service Level Objective)

It is the performance goal we want to achieve in the SLI indicators. We could define that the SLO for the main page loading SLI was 90%, meaning it loads 90% of the time in less than 2 seconds.

100% doesn't exist, it's utopia! Traffic spikes exist, hardware problems happen, attacks happen, etc. Let's be realistic, not even AWS guarantees 100%, let alone us.

SLO Examples:

Availability: The system must be available at least 99.95% of the time in a month.
Latency: 95% of requests must have latency less than or equal to 200ms.
Error/Success Rate: At least 99.9% of requests must return 2xx codes.
Failure Rate: The percentage of 5xx errors must be less than or equal to 0.5% per week.
Throughput: The system must process at least 10,000 RPS during peak usage.
Capacity: CPU/memory utilization must not exceed 80% during 99% of the time.
Saturation: Message queues must not exceed 90% utilization for more than 5 consecutive minutes.
Retention: At least 75% of users must return to use the service within 30 days.

SLA (Service Level Agreements)

SLAs are service level agreements. They are formal contracts between service providers and their customers that detail the entire promised service, performance standards, and consequences if the promise is not fulfilled.

Most of the time when the objective is not met, it involves penalty payment, invoice discount, and even contract termination.

It is common for the SLA not to be the same value as the SLO so that there is room to work and a margin of error.

Error Budget

The Error Budget is the amount of acceptable failures or unavailability that a system can have in a given period of time without violating the SLO.

For example:

Availability SLO: 99.9%.
Total time in a month: 30 days = 43,200 minutes.
Allowed downtime (Error Budget): 43,200 x (1 - 0.999) = 43.2 minutes.

In this case, the system can be unavailable for up to 43.2 minutes in the month without breaking the SLO. In other words, it's the slack we have.

This value encourages balance between innovation and stability. When possible, you can prioritize the delivery of new features even if it increases the risk of instability. It is a way to help make data-driven decisions and define operational limits.

Knowing how to define SLIs and SLOs is important in the DevOps role and can be a great differentiator.

It's no use starting with many SLIs and SLOs, define a few and those that are most important for the project, stakeholders, and customers. Along the way, getting to know the team's capacity better, the numbers should become increasingly challenging and they should be EVERYONE's responsibility.

The tip is to try to start with smaller SLOs and adjust them, and never promise an SLA that you cannot fulfill.

SLI (Service Level Indicator)​

SLO (Service Level Objective)​

SLA (Service Level Agreements)​

Error Budget​

SLI (Service Level Indicator)

SLO (Service Level Objective)

SLA (Service Level Agreements)

Error Budget