There are many analogies for technical debt, but the basic one is the most clear: the more technical debt you have and the longer you leave it alone, the harder it is to pay it off. And a little technical debt is okay, just like a manageable amount of financial debt is okay if it means you can get more done faster.
But why is there a distinction between technical debt – thought to be bad code or partially implemented systems – and the rest of your code? Is the rest of your code bug-free? Instead, consider the idea that all code adds risk. The tradeoffs of implementing features become more clear because you can ask the question: how risky will this code be?
Investors and financial institutions are very concerned with risk. They have a good incentive to study it, as they'll lose all their money if they don't. There are two types of financial risk: systematic and unsystematic (or diversifiable) risk. Systematic risk is the risk of the underlying market, the risk that you can't make go away. Unsystematic risk is controlled by you. If you put all your investments in the stock of one company, you could lose your money because you are not diversified. If, instead, you diversify, your total risk begins to approach the baseline systematic risk.
Products are the same way. There is a baseline risk – after all, you are creating a new technology – and there are risks you assume by your choices. One goal I have is to reduce my unsystematic risk so that the business and the products I make are stable and reliable. So how do you reduce your unsystematic risk?
If code is risk, deleting code should be a goal. Remove dead code. Don't keep it around because you might need it, that's what source control is for. If you think all code is risky, refactoring becomes more important because it focuses the code base and removes wasteful code.
Choose libraries wisely
With modern dependency management tools, it's trivial to add new dependencies. Remember, if you're using someone else's code, you're responsible for the usage of it in your own software. Libraries you use should be non-trivial, actively developed, well tested, and well documented. They should also be as small as is reasonable. That means less code, less risk, and they're easier to understand if you need to read the entire codebase. Large frameworks can increase risk.
When using other people's code, it is critical to understand what their invariants are, and what their guaranteed output is. Code contracts are a critical part of making quality, reliable software. Understand others' contracts and write clear ones for your own software. The risk of misusing code and having your code misused will be much reduced.
Write your core code
If there is one thing your product does that's important, you should have people on staff who have either written that code, or have studied the third party library that implements it. Not knowing how a critical part of your system works is a ticking time bomb.
Minimize new technology
You shouldn't use the new database with a new programming language on the new web framework on the new AWS cloud replacement using the brand new automation toolkit. Pick the least number of new, risky technologies that will make your product better. Be explicit about this choice. Everything else should be boring.
If you open source code that is likely to be of use to others, they will help test and improve it. The mere process of open sourcing code can improve the quality of the code and reduce risk because you don't want the code to embarrass you or your company.
Say no to complicated features
Some product features are just complex. They want a business process that does all the things and reads in a PDF and then outputs a spreadsheet with some nice graphics automatically summarizing the–. No. Not unless that is a huge win. Complicated processes make for bad software. Bad software is risky.
Ensure correctness in many situations
Tests and type checking reduce risk. Make sure they cover error paths in addition to the successful paths that are normally tested. What happens when the database stops responding, or the authentication server is down? Does your system keep working, or does it fail catastrophically?
Product continuity plan
Financial institutions aren't just obsessed with risk of their financial instruments. They study the risk involved with their computer systems, their employees, and even the companies they buy services from. They ask for a business continuity plan so that they can plan for when your company's datacenter goes offline or you go bankrupt. They also want a business continuity plan for your datacenter. You can do the same thing for your product: what happens if S3 goes offline for an hour? If your Azure availability sets go offline? If someone drops your database accidentally? A little preparation goes a long way to mitigating risk.
High bus factor
If you have a few people that are the only go-to people on certain systems, you're in trouble if they cannot work or if they change jobs suddenly. Code reviews are a good way to spread knowledge, pair programming, even better. Improving your software's walk score helps people get into the code base so that you can lower your bus factor.
Automation and DevOps
Despite DevOps being almost a buzzword at this point, the core idea is still sound: have an experienced operations team write and manage the automation, deployment, and monitoring of your software. And remember: no silos, they should work in tandem with the development team, and report to the same people. Huge ops organizations that are distinct from R&D are a recipe for trouble.
Have a process
There are two reasonable release processes: fast and continuous (or nearly), or slow with a long QA cycle. Both of these are acceptable in different circumstances, but it's important to not fall in the middle. Large changes with many releases increases risk. So either release more frequently, or test those big releases carefully.
Some risk is acceptable. But it's important to make that a conscious choice. By choosing your risk carefully, you can plan around it. If you are unaware of the risk of the choices you made, it is more difficult to plan around failure. The goal is when a system fails, your first thought should be: it's okay, we have failure countermeasures in place.
Failure should not be a surprise. Be explicit about the risks you have and what you're doing about them.