This engineering manager is a smart guy, and very experienced. He has a good team, and they've shipped a working product to many customers. He's working harder than ever, and so is his team. Yet, they feel like they are falling further and further behind with every release. The more features they ship, the more things that can go wrong. Systems seem to randomly fail, and as soon as they are fixed, a new one falls over.
Even worse, when it comes time to "fix it right" the team gets pushback from the business leaders, who want more features. If engineers want more time to spend making their old code more pretty, they are invited to do so on the weekends.
Unfortunately, the weekends are already taken, because features that used to ship on a friday now routinely cause collateral damage somewhere else on the platform, and so the team is having to work nights and weekends cleaning up after each launch.
Every few months, the situation comes to a head, where the engineers finally scream "enough!" and force the whole company to accept a rewrite of some key system. The idea is that once we move to the new system (or coding standard, or API, or ...) then all the problems will be solved. The current code is spaghetti, but the new code will be elegant. Sometimes the engineers win the argument, and sometimes they are overruled. But it doesn't seem to matter. The rewrite seldom works, and even when it does, a few months later they are back in the same dilemna, finding their now-old code falling apart. It's become "legacy code" and part of the problem.
It's tempting to blame the business leaders ("MBA-types") for this mess. And in a mess this big, there is certainly blame to go around. But they are pushing for the things that matter to customers - features. And they are cognizant that their funding is limited, and if they don't find out which features are absolutely critical for their customers soon, they won't be able to survive. So they are legitimately suspicious when, instead of working on adding monetization to their product, some engineer wants to take a few weeks off to go polish code that is supposed to be already done.
What's wrong with this picture? And why is the engineering manager suffering so badly? I can relate to his experience all too well. When I was working my first programming jobs, I was introduced to the following maxim: "time, quality, money - pick two." That was the watchword of our profession, and I was taught to treat with disdain those "suits" who were constantly asking for all three. We treated them like they had some kind of brain defect. If they wanted a high-quality product done fast, why didn't they want to pay for it? And if money was so tight, why were they surprised when we cut corners to get the product out fast? Or went past the deadline to get it done right?
I really believed this mantra, for a time. But it started to smell bad. In one company, we had a QA team as large as our engineering team, dozens of people who worked all day every day to find and report bugs in our prodcut. And this was a huge product, which took years to develop. It was constantly slipping, because we had a hard time adding new features while trying to fix all the bugs that the QA team kept finding. And yet, it was incredibly expensive to have all these QA testers on staff, too. I couldn't see that we were managing to pick even one. Other, more veteran programmers told me they had seen this in many companies too. They just assumed it was the way software companies worked.
Suffice to say, I no longer believe this.
In teams that follow the "pick two" agenda, which two has to be resolved via a power play. In companies with a strong engineering culture, the engineers pick quality. It's their professional pride on the line, after all. So they insist on having the final say on when a feature is "done" enough to show to customers. Business people may want to speed things up by spending more money, but enough people have read the Mythical Man-Month to know that doesn't work.
In teams that have a business culture, the MBA's pick time. After all, our startup is on a fixed budget. They set deadlines, schedules, and launch plans, and expect the engineering team to do what it takes to hit them. If quality suffers, that's just the way it is. Or, if they care a lot about quality, they will replace anyone who ships without quality. Unfortunately, threats work a lot better at incentivizing people to CYA than getting them to write quality software.
A situation where one faction "wins" at another's expense is seldom conducive to business success. As I evolved my thinking, I started to frame the problem this way: How can we devise a product development process that allows the business leaders to take responsibility for the outcome by making conscious trade-offs?
When I first encountered agile software techniques, in the form of extreme programming, I thought I had found the answer. I explained it to people this way: agile lets you make the trade-offs visible to whole company, so that they can make informed choices. How? By shipping software early, you give them continuous feedback about how it well it's working. They can use the software themselves, since every iteration produces working (if incomplete) code. And if they want to invest in higher quality, they can. But, if they want to invest in more experiments (features), they can do that too. But in neither case should they be surprised by the result. Sound good?
It didn't work. The business leaders I've run this system with ran into the same traps as I had in previous jobs. I had just passed the burden on to them. But of course they didn't feel reponsible for the outcome - that was my job. So I wound up with the worst of both worlds: handing the steering wheel over to someone else, but then still being blamed for the bad results.
Even worse, agile wasn't really helping me ship higher quality software. We were using it to get features to market faster, and that was working well. But we were cutting corners in the development methodology as well as in the code, in the name of increased speed. But because we had to spend more and more time fixing things, we started slowing down, even as we tried to speed up. That's the same pain the engineering manager I met with was experiencing. As the situation deteriorates, he's got to work harder and harder just to keep the product from regressing.
It was my own failure to ship quality software in the early days of IMVU that really got me thinking about this problem in a new way. I now believe that the "pick two" concept is fundamentally flawed, and that lean startups can achieve all three simultaneously: quickly bring high-quality software to market at low cost. Here's why.
First of all, it's a myth that cutting corners saves time. When we ship software with defects, we wind up having to waste time later dealing with those defects. If each new feature contains a few recurring problems, then over time we'll become swamped with the overhead of fixing and won't be able to ship any new features.
So far, that sounds like just another argument for "doing things right" the first time, no matter how long they take. But that's problematic too. The biggest form of waste is building software nobody wants. The second biggest form of waste is fixing bugs in software nobody wants. If we defer fixing bugs in order to bring a product to market sooner, and this allows us to find out if we're on the right track or not, then it was worthwhile to ship with those bugs.
Here's how I've resolved the paradox in my own thinking. There are two kinds of bugs:
- One kind are what I call defects: situations where the software doesn't behave in a predictable way. Examples: intermittently failing code, obfuscated code that's hard to use properly, code that's not under test coverage (and so you don't know what it does), bad failure handling, etc.
- The second kind of bugs are the type your traditional QA tester will find: situations where the software doesn't do what the customer expects. For example, you click on a button labeled "Quit" and in response the software erases your hard drive. That's a bug. But if the software reliably erases your hard drive every time, that's not a defect.
Defects are what make refactoring difficult. When you improve code, you always test it at least a little bit. If what you see is what will ultimately make it into production, it's pretty easy to make sure you did a good job. But code that is riddled with defects is a cameleon - one moment it works, the next it doesn't anymore. That leads to fear of refactoring, which leads to spaghetti code.
By shipping code without defects, the team actually speeds up over time. Why? Because we never have to revisit old code to see why it stopped working. We can add new team members, and they can get started right away (as an aside, new engineers at IMVU were always required to ship something to customers on their first or second day). And the whole team is gettng smarter and learning, so our effectiveness increases. Plus, we get the benefit of code reuse, and all the great libraries and tools we've built. Every iteration, we get a little more done.
So how can I help the engineering manager in pain? Here's my diagnosis of his problem:
- He has some automated tests, but his team doesn't have a continuous integration server or practice TDD. Hence, the tests tend to go stale, or are themselves intermittent.
- No amount of fixing is making any difference, because the fixes aren't pinned in place by tests, so they get dwarfed by the new defects being introduced with new features. It's a treadmill situation - they have to run faster and faster just to stay at the level of quality/features they're at today.
- The team can't get permission from the business leaders to get "extra time" for fixing. This is because the are constantly telling them that features are done as soon as they can see them in the product. Because there are no tests for new features (or operational alerts for the production code), the code that supports those new features could go bad at any moment. If the business leaders were told "this feature is done, but only for an indeterminate amount of time, after which it may stop working suddenly" they would not be so eager to move on to the next new feature.
- Get started with continuous integration. Start with just one test, a good one that runs reliably, and make sure it gets run on every checkin.
- Tie the continuous integration server in with source control, so that nobody can check in while the tests are failing.
- Practice five why's to get to the root cause of future problems. Use those opportunities to add tests or alerts that would have prevented that problem. Make the investment proportional to the problem caused, so that everyone (even the business leaders) feels good about it.
Good luck, engineering manager. May your team, one day soon, refactor with pride.