Source Code Search Engine

Concept

Programmers spend 50% of their time just looking at source code.
Nearly half of this (25% total) is spent in search and navigation of
code (reference). When trying to understand how a system is organized, they often must look at and across
many files that make up the system.

It is difficult to find code in large software systems of thousands of files coded
in multiple programming languages. Often programmers use string search tools such as
Unix grep or some IDE editor command. grep searches are not fast on thousands of files,
and do not provide any easy way to see the resulting text. IDE searches are limited
to at best the current project, not the entire source code base.

The Search Engine provides an interactive
interface enabling one to search across a large source code base quickly,
using the language structure of each of the languages providing far more
precise answers than simple string searches can produce. For any query,
the Search Engine offers a list of matches with surrounding context;
the user can select a specific match and immediately inspect the source file.

See a screen shot of a query and results over 560,000 lines of code including both IBM Enterprise COBOL and Java.
This example shows a search for occurrences of any variable, either COBOL or Java, whose name begins with the prefix "temp".
Matches are found in both Java and COBOL code, and one such match from COBOL is selected for display in the code browser frame.
Here the search is over all files included in the project that constructs the search index, but a file focus can narrow
the search space within the project, for example to all the "*.java" files.

Metrics

The Search Engine computes Cyclomatic and Halstead Complexity metrics, as well as Source Line, Code Line, Comment Line
and Blank Line counts for each of the files indexed.
This gives users an easy way to determine the relative complexity of system modules of interest. You can see an
example metrics result file.

to find an identifier starting with Interrupt takes
the Search Engine 2.8 seconds. It finds 229 hits only in identifiers
(because that's what was asked). It looked only at .c, .h, or .S files.
Using the UI, you can scroll forwards and backwards through the short list of hits easily to select one.
You can click on a hit to instantly see it in the context of the full source text file with the hit highlighted.

56.6 Seconds: grep
Using cygwin grep for the same task:

grep Interrupt -R C:\work\linux-2.6.19.2

takes 56.6 seconds and produces 5297 hits (most of them in comments or in the middle of identifiers we didn't want).
Looking at 5297 hits is frankly crazy. After deciding what the right hit is,
you still have to type the file name into your editor to see the full source text around the file.
With considerable thought you might write a grep regular expression that weakly approximates
what the Source Code Search Engine does more carefully
(consider ignoring hits in strings and comments).
But that will take you much longer than a minute.
grep climbed through some additional 2000 files in Linux directories that aren't .c, .h, or .S files,
adding to its cost. You can also write a more complex find and grep command that will filter out the unwanted files.
But that requires thought and more typing.

Difference in productivity: 20x or better on just the search part. Since the Search Engine
also shows you the full source text with a single click, you can examine a lot of hits in context very quickly.

Technology

Computer languages are typically structured from a set of allowed elements ("lexemes"), such as identifiers, strings, numbers, operators and punctuation, as well as various kinds of text blocks such as blanks and comments which are ignored by langauge processors. The Search Engine uses a language-specific
scanner to scan each source file and break it into lexemes according to the precise rules for that language.
These scanners are derived from the language definitions
used by DMS Software Reengineering Toolkit, which is used for language-accurate analysis and transformation.
Lexemes with variable content (identifiers, strings, comments, numbers) are converted from thier source code format to a normal form so that character escapes and radix differences are removed, making searches much easier to specify across languages. Scanned lexemes are then indexed to enable fast searches.

It is expected that the complete set of source files of interest are collected, scanned and indexed on a periodic basis, such as daily or weekly. The collected sources are available to the Search Engine for display.

The Search Engine is presently available on Windows 2000, XP, Vista and Windows 7.

Available Lexical Scanners for Search Engine

SD offers a family of lexical scanners based derived from DMS. Presently available are:

AdHoc Text (allows scanning of a "generic" programming language, and/or documents containing English text as phrases or paragraphs of sentences,such as email.)

Topics

Semantic Designs- Our Goal

To enable our customers to produce and maintain timely, robust and economical software by providing world-class Software Engineering tools using deep language and problem knowledge with high degrees of automation.