An Original Algorithm to Find .NET Code Duplicate

The upcoming Visual Studio 2012 version comes with tooling to find code duplicate.

However, today, I’d like to talk about another code duplicate tool for .NET code. This tool comes as a NDepend Power-Tool. Power-Tools are a set of open-source tools based on NDepend.API. The source code of Power-Tools can be found in $NDependInstallPath$\ NDepend.PowerTools.SourceCode\ NDepend.PowerTools.sln.

The algorithm underlying this code duplicate Power-Tool is simple, yet powerful in practice. It consists in defining sets of methods that are using the same members, i.e calling the same methods, reading the same fields, writing the same fields. We call these sets, suspect-sets. Suspect-sets are sorted by the number of same members used.

Let’s run this Power-Tool on the NUnit v2.6.0 code base. The screenshot below shows the 3 steps of the algorithm executed, and the first suspect-set. This first suspect-set is, as often from my testing experience, the entry points of the application. Very often an application has several entry points that does almost the same initialization things. Sometime it is pure duplicate, sometime it is slightly changed. Most of the time this initialization code can be factorized in one or several parametrized method.

Notice the option o for Open callers methods decls. The declarations are opened in Visual Studio actually.

The second suspect-set is the following one…

bingo, we have an exact code duplicate!

The third suspect-set is the following one…

…by looking at the code below, this suspect-set can eventually be considered as a false positive. Personally, I still consider such code as code duplicated. These two methods implement a unique layout algorithm, that could be parametrized. So here I’d vote for a factorization.

I could continue on and on, but I invite you to download the 14-days full featured evaluation of NDepend to see what this algorithm will report on your own code.

Hopefully duplicate results found by this algorithm are pretty relevant. Its main advantage over others algorithms is that it is not disturbed by dirty copy-paste where the pasted code is then sightly modified. Another advantage is that this algorithm can be run on IL code only, source code is not required. Notice that what this current blog post doesn’t show, is that a suspect-set can have more than 2 suspect methods.

On the cons side, sometime a suspect-set can be humanly considered as a false positive. At least it is almost always possible to refactor found duplicates to factorize them into one or several parametrized methods (as explained for the 2 layout methods). The fact is that two or several methods don’t use the same set of members by coincidence.

Also this algorithm is very fast to run, from 10 to 100 times faster than the VS2012 one. Concretely it takes few seconds to be executed on a real-world large code base. This speed is the consequence of the fact that NDepend.API is optimized to browse dependencies very quickly.

Concerning the algorithm itself, it is open-source so if you are interested in browsing and tweaking it you can. This open-source + NDepend.API form makes it particularly suited to be integrated into a CI process.

The algorithm consists in 3 steps:

  1. Investigate each method (including third-party ones) to see if their callers could be considered as suspect (methods called very often like the ones from types in System.Collection.Generics or System.Xml are discarded to limit false positive).
  2. Merge suspect sets obtained from step 1).
  3. Sort suspect-sets from a weigh computed on number of members called.

Hopefully this algorithm can be optimized further, and your feedback is welcomed. Enjoy!

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Marcel Valdez

    One of the best features built-in to the NDepend API, is the IsUsedBy and Uses extensions, magnificent for code analysis, saves alot of lines.

  • Marcel Valdez

    Yes, I did a small Proof of Concept for counting lines of code (method and class definitions + logical lines of code NDepend gives) of Visual C++, VB.Net, C# and Visual J, it gave consistent results on all 4 languages. I also tested it with JScript.Net, but it seems the jsc compiler adds alot of stuff to each class, and NDepend counts this as Logical Lines of Code.

    I also wrote some Proof of Concept stuff to get all methods and classes in those 4 languages, that worked too.

  • Patrick Smacchia

    Glad that you find this material useful Marcel, did you dig into the PowerTool source code and attempted to create your own one?

  • Guest

    Glad you find this material useful Marcel, did you dig into the PowerTool source code and attempted to create your own one?

  • Marcel Valdez

    What would be even more interesting is a combination of this and Simian for a CI report. It’s a complicated setup, but interesting results could come out.

  • Marcel Valdez

    I just started reading about the PowerTools, and I’m already impressed by this example.

  • Leyu Sisay

    Just gave it a try against my code-base, and got a bunch of duplicates.

  • Gregoryyoung1

    like the way this works!

  • Christian Bjerre

    Thanks for the interesting post, Patrick! I will try it out as soon as I have v4 of NDepend on my box.

    Interesting to compare with Simian on the same codebase…

  • Marcos Meli

    Genious !!!

    Great post and great idea to provide NDepend.Api

    Thanks !!