the source{d} engine combines data retrieval and language analysis tools for scalable pipelines that process any number of Git repositories for source code analysis

engine

engine is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.

It is written in Scala and built on top of Apache Spark to enable rapid construction of custom analysis pipelines and processing large number of Git repositories stored in HDFS in Siva file format. It is accessible both via Scala and Python Spark APIs, and capable of running on large-scale distributed clusters.

If you run engine in an UNIX like environment, you should set the LANG variable properly:

export LANG="en_US.UTF-8"

The rationale behind this is that UNIX file systems don’t keep the encoding for each file name, they are just plain bytes,
so the Java API for FS looks for the LANG environment variable to apply certain encoding.

Either in case the LANG variable wouldn’t be set to a UTF-8 encoding or it wouldn’t be set at all (which results in handle encoding in C locale) you could get an exception during the engine execution similar to java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters.

Supported repository formats

As you might have seen, you need to provide the repository format you will be reading when you create the Engine instance. Although the documentation always uses the siva format, there are more repository formats available.

These are all the supported formats at the moment:

siva: rooted repositories packed in a single .siva file.

standard: regular git repositories with a .git folder. Each in a folder of their own under the given repository path.

bare: git bare repositories. Each in a folder of their own under the given repository path.

Processing local repositories with the engine

There are some design decisions that may surprise the user when processing local repositories, instead of siva files. This is the list of things you should take into account when doing so:

All local branches will belong to a repository whose id is file://$REPOSITORY_PATH. So, if you clone https://github.com/foo/bar.git at /home/foo/bar, you will see two repositories file:///home/foo/bar and github.com/foo/bar, even if you only have one.

Remote branches are transformed from refs/remote/$REMOTE_NAME/$BRANCH_NAME to refs/heads/$BRANCH_NAME as they will only belong to the repository id of their corresponding remote. So refs/remote/origin/HEAD becomes refs/heads/HEAD.

Playing around with engine on Jupyter

You can launch our docker container which contains some Notebooks examples just running: