Lessons learned from a real-world focus on performance


I am glad to announce that
we just released a new version of NDepend 
where the analysis phase
duration has been divided by 4 and the memory consumption has been divided by


An interesting question is: Does such a massive performance gain means that it was badly coded the first time?
Previous slower versions met several thousands of functional
requirement needed to analyze properly any .NET application. In this
sense, it was nicely coded. But still, there
were room for improvement and my believe is that there is always room for better performance. Even on the current much faster
version we identified several significant optimizations to be done. The downside
is that this will require a lot of work. So the first lesson learned
is that there is always room for better
We can complete this rule with the fact that the amount of work to make a program run
faster grows exponentially with the gain expected


We quantified each
performance gain source and we obtained this chart. As said we have a 75% gain obtained
thanks to algorithm optimization, micro-optimization and parallelization. The
current analysis duration (25% of the original analysis duration) can be split between 17% of the time spent in our
code, and 8% of the time spent in tier code. In our case, tier code is the code of
used to build some object models of the assemblies analyzed.





The bulk of performance
was obtained with algorithm improvement and this is not a surprise. IMHO the
surprise comes from the relatively small gain obtained from parallelization. Indeed,
we did some massive refactoring to parallelized the analysis with the parallel monologue
idea in mind. Now each
assembly is analyzed as a task that doesn’t have any synchronization need with any
other task. In other words each task uses its own states that are not visible
from the other tasks. Unfortunately this requirement is not enough to ensure
proper scaling on the number of processor. While we get the 15% gain from
between 1 and 2 processors, the gain is almost zero between 2 and 4 processors.
We identified some potential IO contentions and memory issues
that will require more attention in the future. This leads to another lesson: Don’t expect that scaling on many processors
will be something easy, even if you don’t share states and don’t use


When it comes to
enhancing performance there is only one way to do things right: measure and focus your work on the part
of the code that really takes the
bulk of time, not the one that you think
takes the bulk of time. I have a quick real-world illustration. Personally I
have always been amazed by the performance of the C# and VB.NET compilers
. Imagine, it can compile in a few
seconds thousands of source files that took dozens or even hundreds of man-year
to be written. At analysis time, NDepend needs also to parse sources files but
it does it partially to only obtain comment
and Cyclomatic Complexity  metrics. By measuring file loading time
we had a good surprise: loading source file in memory is much cheaper than
expected. For example, to load the 1728 C# sources files of NDepend it takes
less than 0.2 seconds on the 13 seconds needed for a complete analysis of its own code. Now, knowing
that it is almost free to load the source files in memory I am a bit less
impressed by the performance of the C# and VB.NET compilers. And it leads to an
important lesson: NEVER anticipate the
performance cost, ALWAYS measure it
. And the only professional way to
measure is to rely on performance profiler tools. Actually, we use both
dotTrace from JetBrain
and Red-Gate ANTS Profiler and are happy with it.


Something important
concerning the measure of code performance is the percentage of time spent
inside tier code, things like DB or network access.
As we explained in the previous chart, in our case study
35% of the analysis time is spent inside the Cecil code. We can infer from this
number that if we want to have an analysis duration divided by 2 in the future,
our code will need to run 4.3 faster! ((100 – 35) / (50 -35) = 4,3). This
clearly shows the importance of another lesson: assess your limits by measuring the percentage of time spent in tier code.


In the dissection of
performance gain, an interesting point is the 15% gain obtained from
micro-optimization. I mean things like using the right kind of loop
, using the right kind of
, preferring to deal with primitive
CLR objects as int and string (POCO idea), choosing carefully between structures and classes, buffering
properties getter results into local variables or even exposing
non-private fields. I could summarized
the lesson learned by the fact that micro-optimizations
are worth it while premature optimizations are the root of evil
(as Donald Knuth said a long time ago). The
difference between micro-optimizations and premature optimizations is that the
first ones are driven by measurement while the second ones are driven by
guesses, intuitions and hunches.

Something not quantified
in our performance gain is the fact that memory consumption has been roughly
divided by 2. Less memory doesn’t necessarily means less managed objects but in
our particular case we do allocate
less objects. The nice consequence is that less objects means less time spent in the
GC. Less memory also means less virtual memory page fault, which is IMHO the
worst thing when it comes to polish performances of a program.


Finally, we are now experiencing a
great advantage of optimization: automatic tests naturally ran
much faster and it is a good motivation to run them more often!




I end up with an illustration of the tenet there is always room
for better performance
: the Amiga demo scene in which I
participated in the early nineties
. Every demos ran on a constant
It was the perfect incentive for
demo developers to produce code more and more optimized. During several
years, every months
some record were beaten, like the number of 3D polygons, the number of
or the number of dots displayed simultaneously at the rate of 50 frames
seconds. As far as I can estimate, the performance factor obtained in a
few years was something around x50! Imagine what
it means, running in one second a computation that took initially a
minute. And as far as I know this massive gain was the result of both
better algorithms (with many pre-computations and delegations to
sub-chip) and micro-optimizations at assembly language level (better
use of the chip registers, better use of the set of instructions…).



This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Me

    This pingback stuff is annoying the hell out of me. Get rid of it. Why bother showing this crap to your readers?

  • Steve Nuchia

    “Tier” code isn’t a performance floor until you’ve minimized the amount of it you’re calling. The comments about tweaking the open-source “tier” in this case miss the point, kinda.

    You have to look at how you’re using it. I know nothing about the specifics of this example (came here from sutter’s mill) but, generically, once I’m down to “library” and “system” code I want to look at the call graph and make sure I’m using good batch sizes, evaulate using streaming versions, whatever is appropriate to the situation. Usually I can cut the “tier” time down substantially by changing the way I’m using it.

  • http://www.NDepend.com Patrick Smacchia

    Will, indeed on a cold start files are taking more time to be loaded and it seems the analysis is 25% slower all in all (not only because src file loading). But cold start is just one on N, especially for a developer that works all day with the same set of source files.

  • Will Dean

    Your file loading time sounds a bit unlikely unless those files (and the bits of the filesystem which refer to them) are all in cache already.

    Can you really load 1700 files (presumably in multiple directories) in 200ms from a cold start?

  • http://www.syterra.com/MikeStockdale.html Mike Stockdale

    Good post.

    Performance tuning is so simple in concept: measure where it’s slow and make it faster, but often so difficult in practice!

  • http://www.danvy.tv Alex Danvy

    Oh, I miss the Copper and Blitter chips. What a nice plateform was the Amiga.
    I really agree with the idea of mesure first then spend time to optimize.

  • Bob Dobalina

    Great post – I heartily agree with measuring, and at least several times, before cutting.

    …but… one small spelling error gave me many giggles – I believe you meant “asses->s< - your limits…” The way it is now, it’s plural for a donkey. :D

  • Sebastien Pouliot

    Well the example was about Cecil* but that was not the spirit of the lesson ;-)

    Today anyone with a binary-thinking about “his” code versus “other” code is making a {pick your size} mistake. There is a middle ground, which can be ignored or embraced, when you use open source inside your software (be it open source or not). Your (good) choice of such tools makes crossing into the “other” code a real possibility, i.e. something to consider and not a limitation (that’s the lesson ;-).

    Now you don’t have to do this for “other” benefits since it will likely (and positively) affect your own code/software. In fact you should not, at least while doing performance work, make any difference between your code and open source code. Clearly you want your software (!= your code) to get the maximum benefits out of your next coding hour/day/week – so why should you care where this next/best opportunity is ?

    p.s. filling bug is great, but you can (generally) do this on any product (even if IMHO you’ll get better/faster result in an open source project ;-), but that’s padawan, not jedi stuff ;-)

    * that being said I concur that JB is doing an incredible job on Cecil and its future. Part of it is accepting changes (features, bug fixes and optimization) from anyone using Cecil. Clearly there’s no need to wait for the next big release to enjoy enhancements :-)

  • http://www.NDepend.com Patrick Smacchia

    Sebastien, I know that Jb is working hard for dramatic performance gain in the Cecil future. I omited this info to keep it for further post, when the performance gain will be effectively available (as I didn’t talk about NDepend improvements before they are available). This is all about avoiding frustrations and noise, blogging about facts, not expectations.

    And yes, we can put the hand in, as we already do by helping making Cecil a more robust fx by reporting every single problem NDepend users have because of Cecil (I would say around 15 issues reported during 2008).

  • Sebastien Pouliot

    Many lessons you learned young padawan, but still much to learn*.

    asses your limits by measuring the percentage of time spent in tier code.

    Measuring tier code is *not* measuring your limits (but it can be measuring your old decisions ;-). This is even less true when you lucky enough to depend on open-source tier code. You made a good decision in choosing Cecil (not that there were a lot of alternatives ;-) and you can/should capitalize on it.

    Take this lesson from someone who depends very much on Cecil performance *and* invested a bit time into making it faster and less memory hungry: You too can make Cecil better, get the advantages inside your software and contribute to the community around Cecil. If you have reached 35% of your time inside Cecil then it does look like a prime candidate to make NDepend better, performance wise.

    BTW since last week Cecil gained new Has{x} properties that can help you reduce your memory usage quite a bit (since NDepend probably touch every properties and creates empty collection everywhere possible).

    * or as Yoda used to say “use the source Luke”!

  • http://www.NDepend.com Patrick Smacchia

    I am not sure it is that easy Jon, I used to do “a lot” of C++ before .NET.

    For sure you can get a non-neglectable gain in performance with C++ but for sure also you’ll need much much more time to achieve the same set of features.

  • http://jon@jon.com jon

    you forgot
    Migrate to C++ gain 99%