Detecting code duplications in Perl applications

December 08, 2017

Code Duplication Detection

How do we find code duplications in Kritika? There are several ways of doing it and here is ours.

Code duplication is a term for the identical (copy&paste) or similar source code appearing in different parts of the system. It is considered a bad practice. Some argue that it's the worst bad practice in programming, some say it is close to the worst after the wrong abstraction. But why is it bad anyway?

Software developing rarely stops after the program is written. In most cases it is rewritten, modified or adjusted because something is missing, some new feature is needed or a bug is found. Reading and editing software usually takes a way more time then writing it. If during the development (or more often during adding new features or tests) some parts of the code are copied and slightly modified next time when a bug is fixed the developer has to make sure that it is fixed in all places where this duplicated code is used.

Let's look at one example? Imagine news developers are assigned a task of adding a new button to the system. They find a code where it is already solved and since it's been in the system for quite awhile it should be working fine without any problems. So copying it is looks like the safest way. They copy a construct/function/class/, tests it and everything is fine. Later on a bug is found in the button implementation (or a new color is needed) a different developer goes in and fixes the problem. But since he is not the one who copied the code previously he doesn't know that the duplication exists. If he doesn't trust his team members he will go ahead and try to find code that is similar to the one just fixed (and this is generate a good practice), but if he doesn't have time or whatever reason the bug will still be there, just in a different place.

These situations occur a lot more often than you think. If you're a team lead or a project manager and keep enough attention to the code everybody is submitting (we hope you do code reviews) you will be able to spot it long before it's in production. But if not, it will create a lot of problems in the future. Occasionaly similar code can also reveal problems in architecture and abstractions. Thus detecting code duplication is a very useful practice that is going to make code more maintainable and readable.

Ways of code duplication detection

  • String-based
  • Token-based
  • Tree-based
  • Semantics-based
  • Kritika.IO-based


A program is treated as a text and duplications are found accordingly. Some algorithms can be used here like Longest Common Substring for example (similar to what is used in diff program). If you apply LCSS recursively, meaning finding all common substrings, not just the longest one, you can find all duplicated sequences in a file.

Unfortunately since the programming languages tend to have lots of structure tokens like {}, (), [] etc and keywords that are used everywhere if, function, return etc there will be lots of false positives. Also only exact duplications can be found. Even if you try to ignore spaces (which is pretty hard without parsing the language, since you have to disntinguish between spacing and literal strings, special structures etc) the slight identifier name modification will make the code sequence unique.

Moreover programs can have millions lines of code which makes this algorithm rather slow for quick day to day analysis.


  • easy to implement


  • not suitable for programming languages
  • very slow on large files
  • cannot focus on or ignore spefic functions or code structures
  • detects only exact copy&paste


A program is tokenized by a lexical parser producing sequences. Similar technique to the previous one can be used but now instead of comparing actual characters we compare tokens, making the comparison less fragile.


  • easy to implement, but you have to have a language parser
  • more robust than string-based


  • still slow on large files
  • cannot focus on or ignore spefic functions or code structures
  • detects only almost identical copy&paste


A program is parsed into an AST (Abstract Syntax Tree) fully representing the program structure. It is possible extract needed subtrees and compare them. Also it is possible to create subtree's fingerprints (identifiers, code complexity) rapidly increasing the speed of finding similar structures.

The downsides are that many common structures can be used for legitimately different purposes thus creating a lot of false positives. This includes generic constructors, getters/setters, fabric methods etc.


  • detects similar code
  • relatively fast on large files
  • can focus on or ignore spefic functions or code structures


  • difficult to implement
  • many false positives for short subtrees


Instead of focusing how a program does something we focus on what it is doing. This involves high level analysis like dependencies on other code, similar design patterns. Involves much more sofisticated fingerprinting algorithms than for tree-based detection.

Because of the complex calculations does not scale well for large projects.


  • detects semantically similar code
  • can focus on or ignore spefic functions or code structures


  • hard to implement
  • slow in large codebases

What do we do?

As usual in real life the best way is to combine all the methods. Here is what we do:

  1. Parse the source code to AST.


    This stage is pretty straightforward. Depending on the language we're parsing the different tree is produced. We also ignore whitespaces, comments and documentation is preserved for other analysis.

  2. Extract subtrees.

    During this stage depending on the interesting to us subtree type (functions, conditional statements etc) and threshold settings (mininim subtree length) appropriate subtrees are extracted, normalized and saved for later analysis.

  3. Replace literal tokens, function/method calls and classes with their content hash.

    During this stage to make subtrees less generic we replace them with a hash of their content. This makes subtrees different when we use different method names or classes for example.

  4. Shrink subtrees token names.

    During this stage we shrink the subtress making them a lot shorter to speed up the comparison process. For example statement becomes st, identifier becomes id and so on. For example:


  5. Compare the subtrees one-to-one.

    Depending if we want to search for similar or exact AST we use the appropriate comparison operator.

    Additional checks are made to ensure that the subtrees serve the same purpose. We calculate the code complexity as a subtree fingerprint. Comparing the fingerprints (with some % of deviation) allows to discard tottaly irrelevant strings early on.

  6. Do LCSS comparison of the final duplications.

    For duplication ranking we run recursive LCSS to calculate the percentage of the common data. This affects the sorting in the duplications tab. Also highlighting common code helps to analyze results visually as on the first picture of this article.

How to try it out?

Right now code duplication detection is in beta stage. You have to explicitly enable it in your repository settings:

Enable Code Duplication Detection

Sign Up and try it yourself!