Blog

Validating code from command line

January 29, 2018

Ever wanted to validate your code on kritika.io before the commit or push? Now you can do that by using our open source command line tool kritika. For example:

kritika src/MyFile.js
kritika src/AnotherFile.pm

Or you can just be lazy and pass all the modified files:

git diff --name-only | kritika

You can also hide all the old violations and check only if you have introduced the new ones by using --diff-* options:

# compare to the upstream branch
kritika --diff-branch master src/MyFile.js

# compare to a specific snapshot
kritika --diff-snapshot 56 src/MyFile.js

Installing kritika app

Linux:

$ mkdir ~/bin
$ curl https://raw.githubusercontent.com/kritikaio/app-kritika/master/kritika.fatpack -o ~/bin/kritika
$ chmod +x ~/bin/kritika

Configuring kritika app

In order for kritika to know the target repository you should create inside of your working directory a file .kritikarc with the following information:

token=YOUR_TOKEN_FROM_KRITIKA_IO_WEBSITE

Treat this token as a password! We suggest you also do chmod 600 .kritikarc on a multiuser computer. Do NOT commit this file non-encrypted.

And that's it!

Automating validation

Here is a simple example on how to integrate kritika app into your workflow:

#!/bin/sh

remote="$1"
url="$2"

z40=0000000000000000000000000000000000000000

while read local_ref local_sha remote_ref remote_sha
do
    if [ "$local_sha" = $z40 ]
    then
        :
    else
        if [ "$remote_sha" = $z40 ]
        then
            range="$local_sha"
        else
            range="$remote_sha..$local_sha"
        fi

        branch="$(git rev-parse --abbrev-ref HEAD)"

        git diff --name-only $range | kritika --diff-branch "$branch" || exit 1
    fi
done

exit 0

Give it a try! And as usual your feedback is very welcome.

Non-invasive code violations resolutions

December 19, 2017

When running static analysis there is a good chance of getting a false positive or simply allowed violation in this particular case. Most of the tools have a way to ignore violations by specifying a special annotation next to the code like this:

## no critic
$foo = '';

The problem with this approach is that you have to change your code (even if it's just a comment) in order to "fix" it. Also sometimes it is desirable to ignore the violation just for some time until the company has resources to fix it correctly (pay their technical debt).

In Kritika from now on it is possible to "resolve" the violation without modifying your code. Next to the violation there is a special link "Resolve" and by clicking on it you will be presented with a form where you can specify the scope, precision and a comment why it is "resolved".

Violation Resolution Form

Resolution scope

The "resolution" of the violation can be done within the project or within this current file. Of course if you want to ignore the violation for the whole project it is better to just disable the violation in your Profile. But if you want to make this "resolution" temporary then go ahead and resolve it.

Resolution precision

The "resolution" of the violation can be content-oriented which means it will resolve the violations which are caused by exactly the same code. This code can be moved anywhere and it will be still "resolved". So the file change or code "tiding" will not affect the "resolution".

Resolution expiration

Most of the time you want to temporarily ignore the violation, but you don't want to forget to fix it later when time allows. By specifying expiration date this can be easily achieved.

Sign Up and try it yourself!

Detecting code duplications in Perl applications

December 08, 2017

Code Duplication Detection

How do we find code duplications in Kritika? There are several ways of doing it and here is ours.

Code duplication is a term for the identical (copy&paste) or similar source code appearing in different parts of the system. It is considered a bad practice. Some argue that it's the worst bad practice in programming, some say it is close to the worst after the wrong abstraction. But why is it bad anyway?

Software developing rarely stops after the program is written. In most cases it is rewritten, modified or adjusted because something is missing, some new feature is needed or a bug is found. Reading and editing software usually takes a way more time then writing it. If during the development (or more often during adding new features or tests) some parts of the code are copied and slightly modified next time when a bug is fixed the developer has to make sure that it is fixed in all places where this duplicated code is used.

Let's look at one example? Imagine news developers are assigned a task of adding a new button to the system. They find a code where it is already solved and since it's been in the system for quite awhile it should be working fine without any problems. So copying it is looks like the safest way. They copy a construct/function/class/, tests it and everything is fine. Later on a bug is found in the button implementation (or a new color is needed) a different developer goes in and fixes the problem. But since he is not the one who copied the code previously he doesn't know that the duplication exists. If he doesn't trust his team members he will go ahead and try to find code that is similar to the one just fixed (and this is generate a good practice), but if he doesn't have time or whatever reason the bug will still be there, just in a different place.

These situations occur a lot more often than you think. If you're a team lead or a project manager and keep enough attention to the code everybody is submitting (we hope you do code reviews) you will be able to spot it long before it's in production. But if not, it will create a lot of problems in the future. Occasionaly similar code can also reveal problems in architecture and abstractions. Thus detecting code duplication is a very useful practice that is going to make code more maintainable and readable.

Ways of code duplication detection

  • String-based
  • Token-based
  • Tree-based
  • Semantics-based
  • Kritika.IO-based

String-based

A program is treated as a text and duplications are found accordingly. Some algorithms can be used here like Longest Common Substring for example (similar to what is used in diff program). If you apply LCSS recursively, meaning finding all common substrings, not just the longest one, you can find all duplicated sequences in a file.

Unfortunately since the programming languages tend to have lots of structure tokens like {}, (), [] etc and keywords that are used everywhere if, function, return etc there will be lots of false positives. Also only exact duplications can be found. Even if you try to ignore spaces (which is pretty hard without parsing the language, since you have to disntinguish between spacing and literal strings, special structures etc) the slight identifier name modification will make the code sequence unique.

Moreover programs can have millions lines of code which makes this algorithm rather slow for quick day to day analysis.

Pros:

  • easy to implement

Cons:

  • not suitable for programming languages
  • very slow on large files
  • cannot focus on or ignore spefic functions or code structures
  • detects only exact copy&paste

Token-based

A program is tokenized by a lexical parser producing sequences. Similar technique to the previous one can be used but now instead of comparing actual characters we compare tokens, making the comparison less fragile.

Pros:

  • easy to implement, but you have to have a language parser
  • more robust than string-based

Cons:

  • still slow on large files
  • cannot focus on or ignore spefic functions or code structures
  • detects only almost identical copy&paste

Tree-based

A program is parsed into an AST (Abstract Syntax Tree) fully representing the program structure. It is possible extract needed subtrees and compare them. Also it is possible to create subtree's fingerprints (identifiers, code complexity) rapidly increasing the speed of finding similar structures.

The downsides are that many common structures can be used for legitimately different purposes thus creating a lot of false positives. This includes generic constructors, getters/setters, fabric methods etc.

Pros:

  • detects similar code
  • relatively fast on large files
  • can focus on or ignore spefic functions or code structures

Cons:

  • difficult to implement
  • many false positives for short subtrees

Semantics-based

Instead of focusing how a program does something we focus on what it is doing. This involves high level analysis like dependencies on other code, similar design patterns. Involves much more sofisticated fingerprinting algorithms than for tree-based detection.

Because of the complex calculations does not scale well for large projects.

Pros:

  • detects semantically similar code
  • can focus on or ignore spefic functions or code structures

Cons:

  • hard to implement
  • slow in large codebases

What do we do?

As usual in real life the best way is to combine all the methods. Here is what we do:

  1. Parse the source code to AST.

    AST

    This stage is pretty straightforward. Depending on the language we're parsing the different tree is produced. We also ignore whitespaces, comments and documentation is preserved for other analysis.

  2. Extract subtrees.

    During this stage depending on the interesting to us subtree type (functions, conditional statements etc) and threshold settings (mininim subtree length) appropriate subtrees are extracted, normalized and saved for later analysis.

  3. Replace literal tokens, function/method calls and classes with their content hash.

    During this stage to make subtrees less generic we replace them with a hash of their content. This makes subtrees different when we use different method names or classes for example.

  4. Shrink subtrees token names.

    During this stage we shrink the subtress making them a lot shorter to speed up the comparison process. For example statement becomes st, identifier becomes id and so on. For example:

    stsutowotowostblstvatowotosytooptowotoststvatowostlistex[...]

  5. Compare the subtrees one-to-one.

    Depending if we want to search for similar or exact AST we use the appropriate comparison operator.

    Additional checks are made to ensure that the subtrees serve the same purpose. We calculate the code complexity as a subtree fingerprint. Comparing the fingerprints (with some % of deviation) allows to discard tottaly irrelevant strings early on.

  6. Do LCSS comparison of the final duplications.

    For duplication ranking we run recursive LCSS to calculate the percentage of the common data. This affects the sorting in the duplications tab. Also highlighting common code helps to analyze results visually as on the first picture of this article.

How to try it out?

Right now code duplication detection is in beta stage. You have to explicitly enable it in your repository settings:

Enable Code Duplication Detection

Sign Up and try it yourself!