The Engineer’s Guide to Observability: Making Metrics, Logs, and Traces Work for You

How to Measure, Monitor, and Master Your Systems

Jan 19, 2025

∙ Paid

🚀 TL;DR 🚀

Metrics, logs, and traces are the holy trinity of observability, but combining them into a coherent, actionable system is where most teams falter. Metrics tell you what’s happening, logs explain why, and traces reveal how it all connects. Crafting an observability strategy that scales with your system is key to reducing cognitive load and helping engineers focus on solving problems instead of chasing them.

What Will We Dive Into Today? 📖

Why “Monitoring” Alone Isn’t Enough
Metrics: The First Line of Defense
Logging: Telling the Full Story
Traces
How to Assess if your Metrics, Logging, Traces and Alerts are good enough (Paid)
Specific Metrics To Monitor and Alarm on (Paid)
- API Services
- Front End Websites
- Databases

Why “Monitoring” Alone Isn’t Enough

Monitoring is like checking your vitals at the doctor. It tells you if your heartbeat is irregular or your blood pressure is high. But when something’s wrong, you need more X-rays, MRIs, or blood tests to figure out the root cause. Observability is the full diagnostic toolkit, enabling you to dissect, understand, and fix the problem.

In distributed systems services can be scattered across microservices, databases, and queues, so at times traditional monitoring can’t keep up. Metrics alone might show a latency spike, but without logs and traces, you’re left guessing:

Is it a network issue?
A database lock?
A slow external API call?

Metrics

Metrics are time-series data points that show how your system behaves over time. They’re lightweight, real-time, and perfect for spotting anomalies. But metrics without context are like breadcrumbs without a trail.

Going Deeper

Raw metrics are a starting point. The real value lies in how you analyze and interpret them. For instance:

Rate, Count, and Distribution: Instead of just knowing there are 1,000 requests, track their rate per second and their distribution (e.g., 95th percentile latency).
Derived Metrics: Combine base metrics for insights, like error rate = (errors / total requests) × 100.

The Challenges of Metrics

Metrics are lightweight, but they’re also shallow. A spike in error rates might tell you there’s a problem but not why. That’s where logs come in.

Logging

Logs are the textual records of events happening inside your system. If metrics tell you that errors are increasing, logs tell you what those errors are.

Structured Logging Is Non-Negotiable

Unstructured logs are like a messy filing cabinet: all the information is there, but good luck finding it. Structured logging, on the other hand, formats logs in a machine-readable way (e.g., JSON). This makes searching and correlating logs across services much easier.

What to Log

Errors and Exceptions: The “what went wrong” moments.
- Include stack traces, timestamps, and request IDs.
Key Actions: Record significant events like user logins, database queries, or cache misses.
- Add context: Which user? Which resource?
State Changes: Track when something transitions—service restarts, feature toggles, etc.

Common Pitfalls in Logging

Overlogging: Logging every little detail clutters your system and drives up storage costs. Use log levels (INFO, WARN, ERROR) wisely.
Underlogging: Missing critical context can render logs useless. Always include request IDs or correlation IDs to tie events together.

Logs tell a richer story than metrics, but they’re still missing the bigger picture: How does this error fit into the broader system?

Traces

Distributed tracing is the unsung hero of observability. Traces follow a single request as it travels through your system, showing every service, every database query, and every queue it touches.

How Traces Work

Trace IDs: Each request gets a unique ID, passed along as it traverses the system.
Spans: Each operation (e.g., a database query, API call) within the request is logged as a “span.”
Visualization: Traces stitch spans together into a timeline, showing where bottlenecks occur.

When Tracing Shines

Imagine your metrics show that latency has doubled. Your logs reveal that the database is taking longer to respond. But your traces? They pinpoint which query in the database is slow, which microservice is making that query, and which user action triggered it.

Best Practices for Lead Engineers to Assess Observability Effectiveness

Ensuring that the state of your system’s observability is "good enough" takes continuous evaluation and refinement. As a lead engineer, your goal is to validate whether your metrics, logging, and alerts provide the right level of visibility to detect and resolve issues efficiently. The following best practices will help you assess and improve observability across your systems.

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.