Is OpenTelemetry Excessive?

Sun, 27 Nov 2022 08:00:00 +0000

This article is a brief account of my experience setting up, operating, and using Open Telemetry on a very small software development project wherein I reach the surprising conclusion that it's probably worthwhile much earlier and at much smaller scales than you might expect.

The project in question was the back end for a proof-of-concept mobile app that I worked on as part of my day job. This wasn't even a Minimum Viable Product, more of an experiment to demonstrate what an MVP might look like. When I adopted Open Telemetry I was worried that it might be adding needless complexity and overhead to a very basic app, but to my surprise and delight it paid for itself several times over.

Open Telemetry

Open Telemetry describes itself as

High-quality, ubiquitous, and portable telemetry to enable effective observability

It's pitched as a tool for tackling enterprise-grade-highly-distributed-microservice-enabled complexity–the sort of thing that Charity, Liz, and Jessica talk about on the O11ycast.

Concretely, it's a set of standards for

adding diagnostic events to an application (called “instrumenting”)
filtering, transforming, and delivering those events to a variety of back ends

as well as

open-source libraries implementing those standards for various programming languages and runtimes, databases, etc.
open-source and proprietary tools for collecting and analyzing the diagnostic events your application is producing

Once you've set it up, you can turn on “auto-instrumentation” for common software components, which ended up being very valuable.

What I put into it

Unfortunately, it's not all good news: setting up Open Telemetry was more work than I was expecting. The NodeJS libraries are complex (and seem to be in a state of flux?). There's a lot of configuration and setup. The library's interface is also more complicated (and quite a bit more powerful) than console.(log|info|error|debug), which is what I would usually be doing. This all took work and precious time to learn.

I ended up sending logs to stdout as nicely formatted JSON. More sophisticated setups are available, but this 12-factor sort of approach served me well in development (Docker Compose, where I could inspect the logs with docker-compose logs) and in production (SystemD services on EC2, where I used journalctl).

What I got out of it

Once I got the SDK configured properly and wrapped my head around how to use it I was able to instrument my own code, which was valuable as expected. What I wasn't expecting was the comprehensive auto-instrumentation for things like NodeJS's HTTP stack and PostGRES client.

This let me inspect the details of:

every HTTP request that came in to my app
every HTTP request it sent to third-party services
the content, parameters, and timing of every database query
uncaught exceptions

This helped me catch and fix:

several minor-but-subtle bugs and misconfigurations in my own code
request parameter mismatches coming from the mobile app
a catastrophic bug in my auth middleware
problems in the SDKs for third-party services (I have no idea how I would have caught these without detailed HTTP tracing)

These were bugs that slipped past a decent test suite and TypeScript annotations, and I diagnosed them without modifying my app. That's the promise of observability: you can't predict what you should be recording but if you're disciplined and systematic about instrumenting your code you'll be able to figure everything out when you discover what you need.

This seemed like common sense for big complicated distributed systems, but I might be starting to believe it for small straightforward greenfield projects as well.

observability — Nat Knight

Is OpenTelemetry Excessive?

Open Telemetry

What I put into it

What I got out of it