Ilan Peleg serves as Lightrun’s CEO and is one of the company’s cofounders.
In software engineering, observability is the key to handling the complexities of modern architectures. With the growth of microservices and cloud-native components, we are witnessing a boom of new features and services being added every day. Also, as more organizations come online, it is becoming more critical than ever to mitigate the risks of production outages and respond to them quicker.
Unfortunately, observability is hard to do correctly. Traditional tools focus on logging, metrics and tracing to capture the state of the system and alert the on-call engineer when things go awry. This can work when all of this data is available to the engineer, but oftentimes, things are missing for various reasons. For one, observability is often overlooked in the development process in favor of delivering features. Next, engineers may not know to what extent they need to inject these into their code. Finally, these may be suppressed due to cost.
In this article, I’ll look at some events where traditional observability tooling may fall short. Then, as founder and CEO of an observability provider, I’ll dive into the alternative approach using dynamic observability as part of the modern platform engineering discipline.
Case #1: Production Disasters
During a production disaster, no company wants to lose revenue due to downtime and lose the customer’s trust. Most engineering teams now have robust logging and metrics in place to catch these events. However, the more tricky part is triaging the issue and doing a root cause analysis on the fly.
Let’s say that you notice HTTP 500 errors from your load balancer and isolate the issue to a set of backend services. You may have logging and even tracing in place to pinpoint the function that is not performing well. However, you may have noticed that it is not collecting enough information at the current logging level, or you want to take a granular snapshot to understand the state of the system further.
In this case, you often have three options.
• Get breakglass roles and ssh into the machine to collect data manually.
• Add more logging, debug, metrics, tracing instrumentation to the code and redeploy.
• Change the log level to debug or trace.
None of those options are ideal. For one, making changes manually via breakglass exposes the system to human error or unwanted side effects. Also, if you have a slow CI/CD pipeline, redeploying may take a longer time than you want to fix things on the fly. Finally, changing the log level is only helpful if more detailed logs already exist at those levels.
Some mature engineering teams may have circuit-breakers or automatic load-shedding mechanisms in place to allow you to debug without impacting the overall system. Still, to understand more about the system, redeploying the code is often required to gather data.
Case #2: Rollout Of Complex Or Risky Features
With progressive delivery strategies, including feature flags and canary releases, we now have ways to mitigate widespread failures. However, in order for progressive delivery to be successful, engineering teams still need ways to quickly detect failures and roll back or minimize the impact. In other words, developers need robust observability.
The thing with complex or risky features is that it is hard to predict what may go wrong. Even with lots of testing, things can still fail only in production. Catching race conditions or obscure issues at scale may only be possible (or economically feasible) in higher environments. Knowing this, it’s almost impossible to have all the logs and metrics in place for all scenarios.
So what ends up happening is that teams will roll out to a smaller subset of users, detect errors, then roll back to add new logs, redeploy and repeat until the issue is fixed. To the end user, the error might not be seen, but to dev teams, this is a cumbersome process to iterate just to collect a bit more data.
Case #3: Major Changes To The Architecture
In a similar vein as a risky feature rollout, observability plays a vital role in a major migration or change of architecture scenarios. This could be a team migrating from on-prem to the cloud or rewriting a monolith to a set of microservices to make their application scalable.
In these use cases, oftentimes observability tooling works very differently. The team may be using a different deployment model or even a different language in the rewrite. So during the migration process, things may be missed or not work the same way as they did before. Again, this requires slowly backfilling logs and metrics as they discover them during testing or canary releases. This is a slow and cumbersome process.
Alternative Approach With Dynamic Observability
The common theme amongst all of these scenarios is that when a piece of observability is missing, it requires developers to add them and redeploy or replicate them in another environment, wasting a lot of time during the process. Ideally, developers should be able to add them in real-time to save on that valuable waiting time.
To use dynamic observability to your advantage, here are a few considerations.
Adopting dynamic observability usually makes sense when developers and R&D managers feel like they are not able to address production and pipeline issues within their complex apps in the expected timeline. This can be a result of developers not having the proper access to these remote apps—they cannot easily and effectively replicate the target environments on which the issues are occurring, or when there are security and scalability issues that are extremely challenging for the organizations to cope with (e.g., Being required to allow port-forwarding to access remote systems is a security concern, but without dynamic observability, organizations may have to consider it in order to allow developers to debug remotely).
Using dynamic observability requires buy-in from R&D leadership and synchronization with the IT/Ops organization leaders. This works best and addresses the benefits that are highlighted when there is full cooperation. When one side of the team is not fully on board or does not understand the core benefits, problems can occur. These are best solved through communication. Explain what you are solving, why you need it and how it will help.
This is what dynamic observability provides. In other words, dynamic observability allows developers to identify and understand the state of the system in a read-only manner without having to redeploy the application or replicate the issue. This can translate into greater product quality, faster resolution of customer-critical issues and greater developer productivity.
Forbes Business Council is the foremost growth and networking organization for business owners and leaders. Do I qualify?
Read the full article here