Occasionally, ok often, I'm gently mocked for the length of my posts. I start with good intentions of making short, pithy Jason Yip-style posts, then think of something else I want to say and 10 pages later I manage to hit the publish button. This is one of those that got away from me.
I'm writing this post for a reason. Specifically, I'm seeing a lot of harm being done on my project by focusing narrowly on performance optimizations and neglecting maintainability. This is an ongoing discussion and debate for me at work, and I suspect it's going to continue to be an issue as long as I work in the financial world. I think it's important to win this argument, or at least gain some concessions, so I would be very happy to hear everyone else's thoughts on this subject, especially folks that disagree with me. In the course of this post I'm going to criticize the decisions and code of some of my teammates (and myself) on my current project. It isn't that I think they're doing a bad job, just that there is a very real opportunity to do better.
When you're architecting a software system, you must understand what the needs of the system really are and act accordingly. It's important to understand the desired qualities of a system because software design often involves making compromises between opposing qualities. For example, performance and scalability are often very much at odds — you can generally only optimize on one or the other. The point I'm trying to make is that you absolutely cannot focus completely on one quality of your software without considering the consequences to the other "ility" qualities of your system.
In the end, you need to be optimizing on the qualities that genuinely create business value — and I believe that the single most important quality for delivering business value on most projects is maintainability. Any deviation from a more maintainable solution in favor of performance or security or scalability or whatever is dead wrong — at least until proven otherwise. Even if you do have stringent performance or scalability targets, I'm going to argue in this post that focusing on maintainability first will get you to those very same performance or security goals more efficiently in terms of development time.
Early Performance Optimization did not Work
To use a concrete example that I'm dealing with on my current project, the back end developers are consciously coding to minimize the number of IL instructions in an attempt to improve performance. They're very concerned by issues like auto-boxing and the number of objects being created. In fact, it seems to be their main judgement of code quality.
So our code is fast right? Well, no, but we're working on it and making strides. When we started integrating the client and server we found that marshalling data from the server to the client was extremely sluggish. A little bit of profiling from one of my team members showed that we had a fairly severe bottleneck in our transport layer that totally dwarfed the run time of the rest of the system. The IL instruction optimization in the server side code didn't particularly achieve anything. What makes me angriest at myself is that we flirted with a more common approach to integration that I think would be more maintainable in the long run, but went with our current strategy in no small part because we thought it would be faster (sic). I change my mind, what I'm really livid with myself for is not forcing the backend guys to benchmark the more maintainable approach first before we committed to this path*.
Neglecting Maintainability is an Opportunity Cost
The server side developers spent time making optimizations that might, or might not have, made some minor improvements in performance. We collectively made some wide ranging architectural choices for performance that have not, in my opinion, added any value whatsoever.
I've got a major problem with the previous statement. You only have a finite amount of time and resources to throw at your project. Yeah, you can crank up your hours for short bursts, but there's always a cost for doing that. To be truly successful, you should strive to spend these finite resources on things that add the most value. The time spent on the server side architecture for performance bothers me a little bit, but the opportunity cost from not writing maintainable code or automated tests on the server side has been far more significant. From Wikipedia, Opportunity Cost is "the cost of something in terms of opportunity forgone."
The end result? While gaining essentially nothing in performance, we cost ourselves the opportunity to have worked more efficiently with that code in terms of both developer and project time by neglecting test automation and well factored code. What we have is code that is genuinely hard to follow and spot errors through inspection because the methods are too long, with deeply nested if/then and looping constructs for good measure. The existing code is much harder to change than orthogonal code backed up by unit tests would have been. Surprise! That server side code had to be changed in the very next iteration to add new features with additional changes looming for later iterations. If I hadn't spent a couple of days slicing that code up with IntelliJ's automated refactoring support, we could have very easily ended up with code duplication in areas that have a high potential to change in the future. Nothing makes extension harder than having to code the same rules in multiple places (a. More work and b. Greater chance of screwing it up).
Even worse is poor coupling and cohesion properties that have defeated our attempts at writing isolated unit tests. I'm all for integrated FIT style tests, but that shouldn't have to be the most granular testing that you can do on a system. One of the lessons I've learned by dealing with so much legacy code in the last 2 1/2 years is that coding throughput is very much effected by the granularity and quickness of the feedback loop. Writing small, granular unit tests that execute quickly leads to better productivity than the much slower feedback cycle from more coarse grained integration tests. Having to fire up the UI to test something by hand is even slower yet. I will very confidently claim that debugging time goes up geometrically with the coarseness of the testing, and that's significant because debugging is a major drain on developer productivity.
Just to beat this horse into the ground, the system we're building doesn't even have any realistic need for high performance. The data sets are small, and the transaction complexity is fairly mild. What we *do* need is reliability, but the stateful socket connection integration scheme we adopted in the name of performance has added complexity in the way that we deal with server connectivity. I think a stateless connection model, while arguably slower, would have provided more value in terms of business value. While surely improving performance, the proprietary binary formats we use for communication come with the opportunity cost of decreased interoperability, and hence a very real reduction in business opportunities.
Ok smart guy, now my code isn't fast enough!
Back to performance again. So you concentrated on producing the correct business functionality first, with maintainability in mind, and it turns our that your architecture isn't responsive enough, or maybe can't handle the volume, or just that the user interface isn't responsive enough. I'm not really addressing performance optimization and profiling in depth here, but take a quick read through Jeff Atwood's post Why aren't my optimizations optimizing? and you'll see that performance tuning is a tricky business. There are too many conflicting variables to solve the problem through pure deduction alone. Dollars to donuts, I bet you that some of the performance optimizations being done by my colleagues ended up hurting performance instead. The point being that you almost certainly need empirical measurements to measure a range of trial solutions to arrive at better performance.
To drive this point home, let's say that your performance bottleneck is in the communication between physically distributed subsystems. Forgetting for a minute about the cost of making changes to your code base, what can you do to make your system faster?
- Use lazy fetching techniques in fetching parent/child aggregate data structures to avoid fetching the child details when you don't need them.
- Use eager fetching techniques in fetching parent/child aggregate data structures to make fewer network round trips
- Minimize the number of network round trips because that often makes the system faster by compressing the data sent over the wire or gathering data into a more coarse grained Data Transfer Object
- Maybe the fancy compression or transformation of the data is eating up resources. Change that to something else
- Use more background threads
- Eliminate thread swapping by cutting down the number of threads
- Cache shared resources
- Eliminate thread synchronization slowdowns caused by shared resources
- a time to cast away stones, and a time to gather stones together; a time to embrace, and a time to refrain from embracing… (sorry, couldn't help myself)
Wait, some of these changes contradict the other. Which one is right? Are you sure? It could easily take you several attempts to find the right recipe for performance (or scalability or usability).
You definitely need to make changes to improve performance, but those changes cannot break the functionality of the application. Fast, buggy code isn't an improvement on slow, functional code. If you've written maintainable code that exhibits orthogonality, you should be able to contain the changes to isolated modules without spilling into the rest of the code. If you've built a maintainable software ecosystem of full build automation and solid test automation, you can drastically reduce the overhead of staging new code to the performance testing environment with less risk of breaking working code. In other words, the things you do for maintainability should have a direct impact on your ability to efficiently make empirical
Conclusion
There are two general themes I wanted to explore in this post. The first theme is just yet another cry for YAGNI. Try not to invest time or effort into something that isn't warranted. Make any piece of complexity earn its existence first. A lot of this thinking is based on the assumption that it's easier to add complexity to a simple solution when it's warranted than it is to work around unnecessary complexity. I'm also making a large assumption that you can make optimizations later if you've taken steps to flatten the change curve. The second theme is that I think a deliberate focus on maintainable code structure and solid project infrastructure is a more reliable path to quality optimization than early optimization. If your code and project infrastructure facilitate change you can always make adaptations to improve your other "ilities" — assuming that you're paying attention as you work and make adaptations in a timely manner of course.
And, by the way, maintainability is still the most important code quality. Your system may not have to be blindingly fast, or scale like eBay, but it will change. By all means, go learn about Big O notation and delve into the inner workings of the CLR (I'm finally reading the Jeffrey Richter book this week myself). There's a very important point I'm trying to make for anybody engaged in building software, and that is that focusing on maintainability first is very often the most reliable means to get to exactly the other qualities that you need. Ant if you write code that can't be maintained or changed, you're probably on a path to failure.
Wait, there's going to be more. The next post in the "Maintainable" software series is going to be about the DRY principle and the Wormhole Antipattern. First I'm going to give StructureMap a serious DRY'ing out for awhile, then I'm going to come back and tell you how it went.
*I believe that the decision that ultimately led to our performance and tight coupling problems was based far too much on a "Sunk Cost." More on that someday.