Technical debt is the most frequently used buzzword in every engineering organization I’ve had the pleasure to work with in the last fifteen years. Much effort is placed into resolving it, with little to show to the non-engineer. While the term is easy to understand (thank you, Ward Cunningham!), it is quite challenging to grasp what it means for your code or systems and, consequently, nearly impossible to say when the work on it will be complete. To successfully reduce technical debt, it needs framing, scoping, and a clear focus – like we nowadays expect from every product topic. This article aims to give you some ideas of what matters when you work on technical debt, provide you with guidance to separate the more from the less important topics, and as a result, make it easier for you to gain the management support you need to resolve it.
The challenge: 100 million lines of code
Bonial was about ten years old when I joined as a CTO and still has many traits of a startup, like highly engaged people that closely collaborate across functions and react quickly to market needs. It’s part of our success, but it also means that we have collected a decent amount of half baked solutions over the years – not only technical but also feature and business-wise. We’ve already spent quite a lot of time to get rid of them and even made progress, like entirely moving to the cloud. However, it still feels like we’re fighting a lot of unnecessary complexity, and we’re spending a lot of time maintaining things that should not be there in the first place. Somehow, there always seems to be more battles than we can handle.
This raises some obvious questions: Do we need to put more effort into resolving technical debt, i.e., to finally shut down the database from the early days of the company that is still used to manage users? Which legacy deprecation projects should we prioritize? And how will we know that we’re successful, in the sense that we’ve created value for the company?
Let’s not forget that many of the systems we’re trying to get rid of once helped us get where we are today and are still heavily used. Shutting them down is not an option, and re-building them typically takes the same amount of time that went into building them in the first place. So why should we work on something risky and costly at best?
Scores of engineers have struggled to explain this to their managers and ended up spending their time fixing what they thought needed to be corrected instead. There are two problems with this approach: the more obvious one is that it undermines trust in engineering; what’s less obvious is that it limits what you can achieve if you only work on technical debt from an engineering perspective. Technical debt often goes hand in hand with feature debt or business debt – features or processes that are outdated or were hacks in the first place and can be replaced or simplified. Working with product managers and stakeholders will make your legacy removal project much easier and more effective, but it also requires convincing them of the value of removing technical debt.
The value of removing technical debt
The value of re-engineering legacy systems comes mainly from three perspectives: time spent on maintenance and operations, risk of not fixing them, and lead time it takes to change them.
All software that is used productively requires maintenance – patching software to fix or prevent security issues, updating a library that is out of service, changing little things to meet legal requirements, you name it. What makes it hard to modify quickly-built legacy systems is the missing knowledge of how they work and the lack of automation, tests, and documentation. In other words: the shortcuts that are taken building a system will cost you each time you make a change – Technical debt drives maintenance cost.
It’s good practice to comprehend the effort spent on maintenance, and easy to gather the data. There’s no universal answer to what is correct, but I consider teams spending more than 20% of their time operating and maintaining their systems as a sign to dig deeper. Engineering time is expensive, and if you can save 10% of your team’s time by reducing technical debt, that amounts to significant value. Especially since you can leverage it to build new, great products!
Some of our legacy systems still carry a considerable part of our business, and if they failed, we would quickly lose users and revenue. We need to keep them alive, but changing them requires a delicate touch, and there are only a few people left in the company who know how they work in detail. This is measurable as well: look at the number of committers you have for your production systems and critical parts of your infrastructure (assuming that infrastructure is code). If it’s only one or two people that made commits in the last 12 months, you better take good care of them.
Commercially speaking, you are consciously taking the risk of losing the business that depends on that legacy system. The likelihood that this risk materializes depends significantly on the number of people that can keep the system running. Unless you find a non-engineering way to mitigate this risk, you can add the full value of the business running over that system to your value calculation.
Finally, there’s the lead time necessary to change a system. How relevant lead time is as a factor depends on the number of changes, or rather the number of changes you would implement if it were feasible to do so. We all know that a clean architecture, simple design, tests covering the prominent use cases, and automated build and deployment pipeline will have a tremendous impact on the lead time and allow us to implement new features much faster. How much is hard to tell, but the State of DevOps reports demonstrated a direct link between organizational performance and software delivery performance through statistical analysis. They found that only four key metrics differentiate between low, medium, and high performing companies: lead time, deployment frequency, mean time to restore (MTTR), and change fail percentage. As a proxy to the lead time for building features, we can resort to how long it takes to go from code commit to code successfully running in production. Less than an hour is the goal for us. Everything beyond a day harms our capability to adapt to client needs and the market fast enough.
To answer the initial questions: we should focus our attention on legacy systems where only a few people can reliably make changes, that take significant effort to maintain, and that take more than a few hours to change. We can measure these things by looking at the number of active committers, maintenance effort, and lead time (as a proxy for the time from code committed to the working system in production). These metrics help us prioritize what to work on. They also help us understand the value that removing technical debt creates for the company, and ultimately to judge if we succeed at it.
There are always many product topics waiting in the backlog to be built, and the business success might well depend on them. Even if you consider your productive systems healthy, it’s already good practice to invest around 20% of your time keeping your tools and knowledge sharp. Getting to that healthy state might demand considerably more investments. How much more is a discussion that needs to happen with management based on the expected value of things. As we have seen, maintenance cost (or even stronger: cost of opportunity), lead time, and business risk are strong arguments to show that removing technical debt has, in fact, enormous value to the business.
So get some data, do the value calculation, and get your product management on board to tackle the next legacy system!