Tip #1 – Go get a copy of Michael Feather’s “Working Effectively with Legacy Code” book. If you take away one thing from this post, that’s it. The book is full of good ideas for decoupling legacy code and retrofitting tests, but there’s more to the legacy code story.
For the last year and a half my career has largely revolved around various attempts to deal with legacy code. Last year our organization did a partial rewrite of a large, mission critical legacy application. Two of us built a new rules engine component to replace a big bag of spaghetti code written in VB6 that had simply become too difficult and risky to modify. Not surprisingly, the rewrite effort was rough. Our requirements were more or less the existing code, and it was bad in that peculiar way that only copy and paste VB6 code can be. Last fall my team expanded and took over responsibility for the larger legacy application and once again the code and the application infrastructure as a whole was in bad shape. I thought that our new team was demonstrably productive when we were working with new code, but we slowed to a crawl when we had to work with the legacy code. We’ve started to turn around that problem. Here are some of the lessons we’ve learned over the past year and a half about wrangling legacy code into shape.
The Legacy Code Trap
Why do I call it a “legacy” codebase? It’s not simply an issue of technology or age. While we do have problems with older VB6 and ASP Classic code, much of the code I’m labeling as “legacy” code here is written in C# and most of it is less than 18 months old, but it’s definitely legacy code. In his book, Michael Feathers defines legacy code as code without automated tests. I do think that’s valid, but I’m going to broaden my personal definition a little bit. Legacy code is code that you’re afraid of, but is too valuable or big to toss away.
So exactly why are we afraid of the legacy code?
- It doesn’t have adequate automated tests. The lack of tests is a problem in itself, but it probably implies that the system is not testable as well, i.e. tightly coupled, ball of mud, spaghetti code, you name it.
- We don’t understand the code.
- It’s hard to debug. Why, oh why doesn’t the code work?
- It’s hard to deploy. We don’t understand all of the environmental dependencies and the application isn’t easy to troubleshoot. Maybe deploying the application just takes too long.
- The feedback cycles are too slow. We can’t even run the code.
Master your fear. I think we’ve done more harm to ourselves by trying so hard to change certain areas of the code. “That code is too scary to change, let’s go around it” has led us into trouble on a couple of occasions. We’ve created duplication in the system rather than clean up the existing code that is the natural place for a new feature. The one thing you most certainly don’t want to do is make things worse than they already are. As an example, instead of doing a full rewrite of an ASP website we did a minimalist approach that makes the legacy ASP pages forward to ASP.Net pages and vice versa. That effort to reduce the amount of change has generated more work and technical debt in overly complex security and state management code. The single best advice I can give is to concentrate on exactly what you’re afraid of and remove the reason for the fear. Learn about the legacy code, add more automated tests, improve the build scripts, whatever it takes to be able to work with the legacy code with confidence. Things didn’t turn around for us until we developed an ability to change the legacy code without fear.
Cut down the feedback loop. Over the years I’ve noticed that productivity is often a function of how fast your feedback cycle is. In other words, what is the elapsed time between writing a line of code and knowing that the line of code works, and works completely? Ideally I want to be able to immediately execute a small unit test with a right-click or just hit F5 to see screen changes. Our first problem has been simply getting the legacy code to work on our development workstations. If you can’t truly test any new code until you’ve physically deployed that code to a virtual machine or a testing server you’re going to slow to a crawl. We’ve mostly tamed this legacy beast with better build automation.
Once we could run the code on our workstations the next hurdle was being able to write small unit tests. It’s difficult to test code that is tightly coupled with little or no layering. Writing any tests became a marathon of database setup just to exercise a little bit of new code. We have been applying the decoupling techniques from the Feathers book to create new seams to make test writing easier. The important lesson there is that you can improve your productivity with the legacy code by refactoring improvements. Leaving the code alone and coding around the mess doesn’t fix the problems. Sometimes you will have to make some refactorings before you have an automated test safety net. Lean hard on tools like ReSharper to make that refactoring safer.
Automate anything that moves. I took a poll in the office yesterday and we all agreed that the single, most beneficial thing we’ve done to our legacy code is to streamline our build automation. Time and time again I’ve struggled just to make legacy code run on my workstation before I could work on the code. It took a *lot* of effort, but we can now walk over to a fresh new machine, download the code tree from Subversion, run the NAnt script to rapidly get the application up and running with a fully comprehensive NAnt build script. Besides the obvious benefit of getting the code running, adding the complete environmental setup to the NAnt build forced us to learn about all of the environmental dependencies and the physical architecture of the application. It’s my opinion that a complete automated build script is the single best architectural document of a system (it’s certainly the most accurate).
The other thing we did with the build was to optimize the build times. I spent a not entirely pleasant two weeks reorganizing a VS.Net solution to cut down the number of projects, optimizing slow running tests, and reducing the number of files that needed to be copied around. That investment took the build time from 20 minutes to under 2 minutes. Simply executing a single unit test went from about 4 minutes to 30 seconds. I believe that the time savings has more than paid for itself in the mere four months we’ve been using the streamlined codebase. CruiseControl.Net has a report to show NAnt timings (I would assume there is an equivalent for MSBuild) that we used to great effect to pinpoint the bottlenecks in the build.
Streamline deployment. One of the benefits of Continuous Integration is that it smoothes out system deployment, both by giving you more practice deploying and by exposing deployment difficulties. We’ve benefited from some minor architectural changes to reduce the number of moving parts in favor of “XCOPY” deployment wherever possible. We’ve eliminated registry keys in favor of configuration files, eliminated ALL dependencies on absolute paths, and pulled some nonvolatile configuration into embedded resource files. Our deployment practice had been to create MSI’s with WiX, then manually copy the MSI files to both the web and application tiers for deployment. It was taking two hours every single time we made a testing push and we screwed up too many times. We dumped the MSI’s and went to a much simpler NAnt deployment script. The next step was to put CruiseControl.Net on our development and testing servers to run the NAnt deployment scripts remotely. We cut the two hour code push cycle down to two minutes while cutting down deployment errors by eliminating manual steps. Production rollouts are much easier now. We’re optimistic that the improved deployment processes will enable us to do more frequent releases to keep up with business demands.
Why are you sick? One of the worst things initially with our legacy code is that it swallowed exceptions and obfuscated environmental issues. The only indication of system failure was a boolean return code that denoted failure or a generic “system is too busy” message on the screen. If you had tribal knowledge of the system you could read the tea leaves and start checking registry values or database entries or check whether COM (must die) objects were correctly registered. We’ve had some success in making the system tell us more useful information by putting in more descriptive exceptions for common failure conditions. I found the places in the code where exceptions were being swallowed and moved that code behind an error processing abstraction that is attached via StructureMap. In production it continues to hide the stack trace from users, but on our boxes and the test server it uses a different implementation that simply throws the exception back up for quicker and easier trouble resolution. One of the most useful things we’ve done is to add comprehensive environment tests to find configuration problems fast at deployment or build time.
Jim Shore in his article Fail Fast describes this as a quality of good design.
Characterization tests are a double bladed lightsaber – If you’re replacing all or part of an existing system you probably need to know exactly what it does. If you’re going to extend legacy code without tests you need a way to ensure that you don’t break any existing functionality. Characterization tests are automated tests you create by “recording” the output of existing code from a series of inputs. When you write the new code you simply make sure that the same inputs lead to the same outputs. These tests definitely had value, but we did not fully recoup our very large time investment (we think that the effort comes to about 6 man-months for a team of 4-6 developers over a year’s time). If you’re going to sink a substantial effort into characterization tests focus on creating tests that are human readable. The characterization tests we wrote were all thrown away because they were too much effort to maintain (the code stunk and we couldn’t effectively debug failures), leaving us without an test automation test infrastructure. We are accreting a FitNesse suite of tests now that is human readable, but it would have been nice if we’d gone down that road to begin with. If we could go back in time I think we would have diverted more of the characterization test effort into fewer, higher quality test automation that would have continued to provide value over time. I would probably recommend not trying to completely automate the characterization request recording. I think some manual test writing might have led to cleaner tests that would have continued to provide value.
Haven’t you always thought that a lot of Jedi Knights probably ended up amputees from lightsaber accidents, or is that just me?
Perfect is the enemy of the good. You aren’t going to fix all the problems at once. It’s too much change at one time and you’ve got new functionality to create anyway. One thing we picked up from the Feathers book is that even partial test coverage is an improvement over no coverage. Just picking an example we’ve talked about at the Austin Agile meetings, you aren’t going to be able to rewrite your user interface to incorporate a Model View Presenter architecture to get decoupled unit tests, you might have to live with coarse grained integration tests instead for a while. You certainly won’t be able to stop new work long enough to do all the refactoring and test creation that you want to do. In my case it means overlooking the fact that some of the code I’m working with is crap.
Queue up ideas for technical debt reduction. You’ll never reach your destination if you don’t know what the destination is. Constantly think of ways to improve the codebase to improve your productivity and refine your thinking about the legacy system. Talk over the ideas with the rest of the team to socialize a technical vision so you can at least start moving in a better direction. To me, the hardest problem with improving legacy code is finding ways to do it incrementally. Having some architectural strategies in mind has helped us target small refactorings as we go to nudge us a little closer to the end goals without overextending ourselves at any point.
On most of the XP projects I’ve been on we’ve used an “Idea Wall,” just a visible place to write down or post technical improvements. Anytime we have some slack time we start pulling down tasks from the idea wall. Occasionally we’re able to outpace either testing or requirements analysis and we aren’t really able to push any new code. Whenever that happens we immediately pull things off of the idea wall. One way to judge if your technical debt is piling up is to watch how crowded the idea wall is getting. On the other hand, if something stays on the idea wall for a long time, it might not be that important after all.
Design never stops, not even for an older codebase.
Connect technical improvements to new business functionality. Since you can’t do everything at once, concentrate on making the technical improvements that will help you immediately with new business features. In other words, don’t run out and create a large automated test suite for code that you aren’t going to modify in the near future. If you’ve already have some refactoring ideas, put in play the ideas that will enable the next set of features. Some of the refactoring ideas we have floating around are geared towards enabling some possible changes in product direction that our business partners are considering. These refactorings will be triggered into the active story list almost automatically if we do take these new business directions. It certainly helps if you have a stable project roadmap, but even vague product plans can help you target improvements to the existing code.
Communicate the Technical Debt problem to management. Management has to understand the impact of the existing technical debt. Technical debt can dramatically reduce your productivity. On one hand management needs to understand that a team’s efficiency is negatively impacted by the technical debt (CYA), on the other hand management will probably have to approve any effort to pay down technical debt that will detract from creating new functionality. To get this approval, you have to frame the technical debt problem in terms they can understand. I like to compare technical debt to an opportunity cost. Directly connect technical improvements to opportunities to deliver more business value at less cost. Build your case over time by repeatedly communicating the technical debt issue so that it doesn’t get blown off as an occasional excuse for poor performance.
Protect your credibility with management at all costs. Building tools that don’t demonstrate any value or architectural “improvements” that don’t improve anything will make management even more reluctant to approve technical debt reduction efforts. I know we’ve suffered a bit in our relationship with one of our senior managers because of some failed attempts to improve code infrastructure in the past. I walked into a case like that several years ago with a large legacy VB6 application. The system was having massive performance and stability problems. After taking a look at the code, one of my suggestions was to rip out and replace a very poor implementation of CSLA business objects* that I thought (and still do) was adding a lot of overhead and inefficiency to the data access and business logic. As it turned out the very thing I was suggesting be thrown out and rewritten had been the previous tech lead’s highly visible scheme to fix the performance and stability problems. My suggestions were shot down pretty quickly.
I’ve wrestled with this issue before in this post: Balancing Technical Improvements versus New Business Features. Since I wrote that post I think our management has become much more responsive to the issue. In the end, I think our credibility on this issue has improved because we have shown increased productivity with some of our efforts to down technical debt (more automated test coverage, refactoring, faster builds, more reliable deployment scripts and procedures). When you can, try to partially link the success of a project to the efforts you made to reduce technical debt. The most important lesson we’ve learned in regards to negotiating technical improvement with management is…
Just do it. If there is some kind of improvement that you can make that will pay off within a short time frame, just do it. Don’t agonize over it, don’t negotiate with skeptical management, just quietly do it. We started taking this attitude in January and we’ve seen our productivity improve substantially. There are other factors at work too, but the proactive attitude of constant technical improvement has consistently reduced friction in our day to day work.
- The Strangler Stategy – We’re trying to do this for some of our components.
- We are retrofitting a whole new Domain Model approach backed up by NHibernate to replace the Transaction Script style coding. I think we’re going to be able to collapse a disturbing degree of duplicated business rules and data access into the new domain model classes and mappings, not to mention a very great improvement in the testability of the application code. As a first step we’re using the new domain classes inside the FitNesse fixtures to help generate test data and validate outcomes as a prelude to using the domain classes in the main functions. More on this in a couple of months.
*People occasionally ask me my opinion about the CSLA framework. It’s probably not fair, but I’m largely dismissive of CSLA because of this experience.