Observability Concepts

Observability represents a fundamental shift in how we monitor systems, allowing us to understand them from the outside in. Unlike traditional monitoring, observability enables us to ask questions about the system without necessarily knowing its detailed internal workings. This capability is especially crucial for identifying and resolving not only known problems, but also "unknown unknowns" - those we couldn't anticipate.

To achieve effective observability, systems need to be properly instrumented. Instrumentation consists of the code's ability to emit signals that allow us to understand its behavior. OpenTelemetry (OTel) initially focuses on three fundamental types of signals:

Traces: Allow visualizing the path of a request through different services
Metrics: Provide quantitative data about system performance
Logs: Record specific system events

It's important to note that although OpenTelemetry starts with these three pillars, its architecture was designed to expand and incorporate other types of telemetry as technology evolves.

Modern Computing Context

OpenTelemetry emerged as a response to the specific needs of the cloud-native era, where applications are characterized by:

Distributed architectures
Serverless components
Container execution
Intense communication between different services

In this context, distributed tracing gains special importance, as it allows visualizing and understanding data flow through multiple services and components.

Observability effectiveness can be directly measured by its impact on business indicators, especially:

Mean Time To Detection (MTTD): Time to detect a problem
Mean Time To Resolution (MTTR): Time to resolve a problem
Service availability (Service Level Objectives - SLOs)

Monitoring Evolution

To contextualize OpenTelemetry's importance, it's useful to compare with the traditional scenario:

Monolithic Applications:

Single process execution
Direct database communication
Centralized diagnostics
Simplified chronological analysis

Modern Applications:

Multiple distributed services
Asynchronous communication
Complex dependencies
Need for correlation between different signals

This evolution highlights why traditional monitoring tools are no longer sufficient for modern systems, and why we need a more sophisticated approach like OpenTelemetry.

Difficulties Finding the Problem

In a scenario where we have a monolith that loses communication with the database, we would usually go straight to the logs to look for records of everything that happened, as they were in chronological order and in one place.

alt text

The problem starts to appear when one system calls another system that calls another. If we had the following structure, where could the problem be?

alt text

Application A cannot communicate with the database.
Application B cannot communicate with Application A and causes some database failure.
Application C cannot communicate with Application A and causes some database failure.

And this will get increasingly complex.

If we added the information that Application B always works with Application A, we would know B and A work, that is, we have no problem between them and communication with the DB is up. Therefore we would already know the problem is in Application C.

This is what traces help with, understanding the story between different components.

The log will inform us of the single process history of the application and what happened in that process.
Metrics help us see the application's health.
Traces provide context, that is, the path, the distributed path between different components.

This way it will be possible to understand how components are communicating with each other.

Logs

A log is a temporal record of an event that occurred in a system. It's like a "digital note" that contains:

A timestamp (date and time of the event)
A message describing what happened
Usually also includes a severity level (like ERROR, WARN, INFO, DEBUG)

For example, when you log into a website, the system might generate a log like this:

2024-01-14 10:30:15 [INFO] User 'john123' logged in successfully

Logs are independent - that is, they don't necessarily need to be connected to a specific user action or transaction. They can record any type of system event, from errors to routine operations.

An important characteristic of logs is that they are like "snapshots" of specific moments - they record what happened at that instant, but don't automatically show the path that led to that event or what happened afterwards.

Metrics

Metrics are numerical data representing the state or behavior of a system over time. They are quantitative measurements that, when collected and analyzed, help understand the performance, health, and resource utilization of an application or infrastructure.

Common Types of Metrics:

Counters: Numbers that only increase (e.g.: total requests received)
Gauges: Values that can increase or decrease (e.g.: active users)
Histograms: Distribution of values in intervals (e.g.: response time)

Practical Examples:

Error rate per minute
CPU and memory usage
Average request response time
Number of connected users
Transactions per second
Available disk space
Processing queue size

It's also worth reading about observability service levels to understand where these numbers can help us.

Traces: Distributed Tracing

A trace is like a "story" that shows the complete journey of a request through a distributed system. Imagine a detailed timeline that records each step a request goes through, from the moment it enters the system until its completion.

Fundamental Concepts

Trace is the complete journey of a request. Composed of a series of connected spans and identified by a unique trace ID.

Shows end-to-end flow through different services

A Span represents a single operation within the trace. Can be an HTTP call, database query, or internal processing and contains crucial information:

Operation name
Timestamps (start and end)
Duration
Operation status
Attributes (metadata)
Important events during execution

Spans are organized in a parent-child structure. A parent span can have multiple child spans. Each child span represents a sub-operation of the parent span. For example, when you log into a website, a trace might show:

Receiving the login request
Credential validation
Database query
Access token generation
Response return

This structure allows quickly identifying where performance bottlenecks are or where errors occur in distributed systems.

Let's demonstrate a trace visually using Jaeger-UI. Graphically we have something like what we have below. We won't go into details now, but we can see it works with a call tree with parent-child relationships of who called whom.

alt text

Notice that the trace tells a story of what happened with a call to /purchase-order visually in a tree. On the side we have a timeline to help understand the time things were processed.

What we're seeing here is what we call Context, that is, context of everything that happened from the main call and the time it took to complete everything and the time of each step.

Only after the call to /user/verify was completed in 67ms was the call to /update-stock made.

So we have something like /purchase-order > /user/verify > /update-stock.

With traces we can see what was executed sequentially and what was executed in parallel.

Modern Computing Context​

Monitoring Evolution​

Difficulties Finding the Problem​

Logs​

Metrics​

Traces: Distributed Tracing​

Modern Computing Context

Monitoring Evolution

Difficulties Finding the Problem

Logs

Metrics

Traces: Distributed Tracing