I am glad to announce that
we just released a new version of NDepend where the analysis phase
duration has been divided by 4 and the memory consumption has been divided by
2.
An interesting question is: Does such a massive performance gain means that it was badly coded the first time?
Previous slower versions met several thousands of functional
requirement needed to analyze properly any .NET application. In this
sense, it was nicely coded. But still, there
were room for improvement and my believe is that there is always room for better performance. Even on the current much faster
version we identified several significant optimizations to be done. The downside
is that this will require a lot of work. So the first lesson learned
is that there is always room for better
performance. We can complete this rule with the fact that the amount of work to make a program run
faster grows exponentially with the gain expected.
We quantified each
performance gain source and we obtained this chart. As said we have a 75% gain obtained
thanks to algorithm optimization, micro-optimization and parallelization. The
current analysis duration (25% of the original analysis duration) can be split between 17% of the time spent in our
code, and 8% of the time spent in tier code. In our case, tier code is the code of
Mono.Cecil,
used to build some object models of the assemblies analyzed.
The bulk of performance
was obtained with algorithm improvement and this is not a surprise. IMHO the
surprise comes from the relatively small gain obtained from parallelization. Indeed,
we did some massive refactoring to parallelized the analysis with the parallel monologue idea in mind. Now each
assembly is analyzed as a task that doesn’t have any synchronization need with any
other task. In other words each task uses its own states that are not visible
from the other tasks. Unfortunately this requirement is not enough to ensure
proper scaling on the number of processor. While we get the 15% gain from
between 1 and 2 processors, the gain is almost zero between 2 and 4 processors.
We identified some potential IO contentions and memory issues
that will require more attention in the future. This leads to another lesson: Don’t expect that scaling on many processors
will be something easy, even if you don’t share states and don’t use
synchronization.
When it comes to
enhancing performance there is only one way to do things right: measure and focus your work on the part
of the code that really takes the
bulk of time, not the one that you think
takes the bulk of time. I have a quick real-world illustration. Personally I
have always been amazed by the performance of the C# and VB.NET compilers. Imagine, it can compile in a few
seconds thousands of source files that took dozens or even hundreds of man-year
to be written. At analysis time, NDepend needs also to parse sources files but
it does it partially to only obtain comment and Cyclomatic Complexity metrics. By measuring file loading time
we had a good surprise: loading source file in memory is much cheaper than
expected. For example, to load the 1728 C# sources files of NDepend it takes
less than 0.2 seconds on the 13 seconds needed for a complete analysis of its own code. Now, knowing
that it is almost free to load the source files in memory I am a bit less
impressed by the performance of the C# and VB.NET compilers. And it leads to an
important lesson: NEVER anticipate the
performance cost, ALWAYS measure it. And the only professional way to
measure is to rely on performance profiler tools. Actually, we use both
dotTrace from JetBrain and Red-Gate ANTS Profiler and are happy with it.
Something important
concerning the measure of code performance is the percentage of time spent
inside tier code, things like DB or network access. As we explained in the previous chart, in our case study
35% of the analysis time is spent inside the Cecil code. We can infer from this
number that if we want to have an analysis duration divided by 2 in the future,
our code will need to run 4.3 faster! ((100 – 35) / (50 -35) = 4,3). This
clearly shows the importance of another lesson: assess your limits by measuring the percentage of time spent in tier code.
In the dissection of
performance gain, an interesting point is the 15% gain obtained from
micro-optimization. I mean things like using the right kind of loop, using the right kind of
collection, preferring to deal with primitive
CLR objects as int and string (POCO idea), choosing carefully between structures and classes, buffering
properties getter results into local variables or even exposing
non-private fields. I could summarized
the lesson learned by the fact that micro-optimizations
are worth it while premature optimizations are the root of evil (as Donald Knuth said a long time ago). The
difference between micro-optimizations and premature optimizations is that the
first ones are driven by measurement while the second ones are driven by
guesses, intuitions and hunches.
Something not quantified
in our performance gain is the fact that memory consumption has been roughly
divided by 2. Less memory doesn’t necessarily means less managed objects but in
our particular case we do allocate
less objects. The nice consequence is that less objects means less time spent in the
GC. Less memory also means less virtual memory page fault, which is IMHO the
worst thing when it comes to polish performances of a program.
Finally, we are now experiencing a
great advantage of optimization: automatic tests naturally ran
much faster and it is a good motivation to run them more often!
I end up with an illustration of the tenet there is always room
for better performance: the Amiga demo scene in which I
participated in the early nineties. Every demos ran on a constant
hardware.
It was the perfect incentive for
demo developers to produce code more and more optimized. During several
years, every months
some record were beaten, like the number of 3D polygons, the number of
sprites
or the number of dots displayed simultaneously at the rate of 50 frames
per
seconds. As far as I can estimate, the performance factor obtained in a
few years was something around x50! Imagine what
it means, running in one second a computation that took initially a
minute. And as far as I know this massive gain was the result of both
better algorithms (with many pre-computations and delegations to
sub-chip) and micro-optimizations at assembly language level (better
use of the chip registers, better use of the set of instructions…).