Simple insights into source code bases

Does the code base scream the domain?

I was wondering whether or not it would be possible to use parts of PageRank (https://en.wikipedia.org/wiki/PageRank) to gain insights into a code base. If PageRank works on web pages to ascertain what the contents of the page relates to, then likely a similar way could be construed for source code.

The simplest thing that could possibly work?
I chose the n-gram (https://en.wikipedia.org/wiki/N-gram) approach - unigram to be specific. While bi- and tri-grams are better for text, I’m not so sure for code bases, nevertheless, it could be tested.

The simple process

Find all files of a specific language inside the project structure. Likely it would be prudent to examine source and test code independently

Remove all for of new-lines

Tokenize on non alphanumeric entities

Build histogram of these tokens

Removing comments and possibly strings would likely be a good idea, but that would require parsing and not just bash.

Looking at gerrit’s word frequency, we get something along these lines:

27560 import

25092 the

21758 com

21385 google

16615 License

14676 public

14544 gerrit

13553 String

12309 final

11823 return

10431 private

10191 new

9809 if

8940 this

8196 0

7665 in

7225 void

7163 under

6809 a

6590 null

6389 server

6234 client

6185 static

6125 for

6024 2

5965 org

5953 to

5384 or

5212 class

4972 may

4963 Override

4934 name

4923 get

4752 distributed

4666 of

4602 java

4492 throws

4392 n

4164 is

3705 e

Reading it “import the com google License public gerrit String final return private new if this 0 in void under a null server client static for 2 org to or class may Override name get distributed of java throws n is e” doesn’t quite make sense. Clearly the “License” and namespace “com.google” influences heavily.

Removing the keywords we get: “the com google License gerrit 0 in under a server client 2 org to or may name get distributed of n is e”

It is not as if the source code really screams what gerrit is about. From Chinese Whisper reconstruction I get something about a “client server with name distribution” - not quite the “Gerrit provides web based code review and repository management for the Git version control system” tagline.

The frequency count drops rapidly - let’s pull the data into R to see if there are some patterns.

This seems to be a power law distribution, but with a lot of outliers above 7 (corresponding to around 1100) - and with an anomaly just short of 8 (corresponding to 2374 to be exact). This is quite likely the template License.

Independence check

With the previous information - and the notion of a vector representation - I thought about the possibility to check for independence.

If two vectors are independent, then they should be orthogonal. If two code bases are independent, then they should be orthogonal in their domain vectors. To test this, we can try to plot the words used in the code bases. Naturally, we would need to strip away the language keywords, but as we will see, this is not quite as necessary as expected. We can even gain other insights by looking at the keyword uses.

So, as above, I created word frequence files for two JavaScript projects.

The second to last in this list has been censored, it does provide an indication that the projects aren’t quite independent. The error, err, and data are so common and nondescript that it is somewhat okay to find them in this area, though I’d rather have less callback functions and better names in general.

Again this can be explained by a lot of callbacks, which are often on the form:

function(err, data) {
if(err){
} else {
}
}

Another explanation could be lots of anonymous functions, though usually callback.

Conclusion

Removing comments and imports should provide for a better picture of the code base. Even so, it seems to not exactly scream the domain or architecture.

Bi-grams could be another improvement.

Independence check of supposedly independent projects may reveal that they aren’t or that the code is skewed towards an unwanted design.

It is far from perfect, but as always it brings a different way of looking at the code base, and it is relatively quick to do.

Comparing large code bases somewhat defeats the purpose as regression to the mean tells nothing much of interest. Taking Gerrit as an example, then the most used token is “import”, which is used 27560 times and as we saw above, the interesting parts reveal themselves around 1100 uses, which is less than 4%.

Comparing Gerrit and an old repo I had of dotCMS, we find that the most used keywords including entities in java.lang are:

import
String
public
return
if
new
this
null
private
void
static

Which could indicate a lot of String constants and conditional logic (with return statements instead of else clauses), and with a possibility of Primitive Obsession - well, the web does call for a lot of String use.

This entry was posted
on fredag, april 8th, 2016 at 14:34 and is filed under Software development, programming.
You can follow any responses to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.