He argues that representing code in flat files does not allow to structure code in the most appropriate way. Both the order of functions or classes in a file and the folder organization of a program depend on arbitrary choices of programmers and reflect the ideas they have about structuring code and expressing its meaning. However “no two programmers think identically alike” and as soon as source code involves several contributors, the structure risks to be modified thus loosing coherence at each separate level – code structure within files and folder structure of the program – and between the two.

Even though solutions exist to reduce these risks – e.g. separating out things into as many files as possible, marking out regions of code - Rick Minerich believes that these solutions offer only a partial response to the issues he raised because “they are anchored to flat files”.

Moreover, in some cases it may be interesting to have “a different ordering/meaning […] for a particular task”, but it is rather unconceivable to reorganize code represented by flat files for each separate task.

To respond to these issues, Rick advocates for a different approach to code representation:

If you can treat the reflected code from a programming language like an abstract data structure, why can’t you just keep the source itself in a similarly abstracted data structure? Isn’t the structure of a program more similar to a graph, than a list?

[…]

If we kept our code in queryable data structures it would be easy to lay our environment in any way we chose. […] You could also, for instance, show a method and everything which references it. The possibilities for code visualization are limitless.

[…]

The real boon of moving on is the power and understanding we will gain from being able to visualize the structure of our programs in any way we choose.

It strikes me that the process of figuring out which variables you're touching when you're compiling a line of code is really a database query. Scoping and the semantics of scoping are part of the query (as well as how the database has been built).

Further, the actual link of a completed compile (whether or not it's being done at build time or run time), is another query.

The process of compilation should really be the process of building up a database.

Several commentators, however, drew attention to the fact that such approach to code already exists. Keith Braithwaite argues, for instance, that “the logical conclusion of what [Rick Minerich is] talking about is an image-based environment” which exists in Smalltalk and Lisp. Along with Smalltalk, another commentator gives the example of Visual age suite where “all code source is stored in a code database […], and you can query it any which way you want”.

However, Steve Hawley along with other bloggers stresses that one should not dismiss advantages of using flat files. They allow efficient navigation through code since humans “are very aware of space and spatial layout of things and this translates naturally into flat files” so that people “develop a familiarity with the layout of a file and can navigate very efficiently to the right location within it via muscle memory”.

Things like the number of spaces between operators can be used for nice stuff like laying out bits of consecutive lines that have parallel meaning so that they line up. Ordering of functions can be chosen so as to tell a narrative. People have grown quite creative in using the tools they have to write expressive code. If you're going to take this away, I expect to see a good reason to believe that it can be replaced by something equally effective.

The issue of tooling was also raised by Rick Minerich himself who argues that one of the reasons why flat files are still used lays in the fact that all the tools have been built for flat files structured code. Almost all compilers, for instance, require having a complete program. He believes that “a language which is not tied to traditional compiling and linking would be ideal for research into keeping code in abstracted data structures” and suggests a first step solution for supporting query based code:

A good first step would be an IDE/Editor that can manage all of the code in a database and allow the programmer to dynamically construct queries to build views and otherwise manipulate the code. The environment could then generate flat files in order to be compatible with current compilers.

I think source control would be the biggest problem. I've dealt with a non-flat file programming language and even though source control was supported with a merging plugin there were numerous performance issues and bugs which made it very difficult to use. The files weren't human readable so if there was a bug in the tool you were screwed. I witnessed a 30 minute bug fix turn into a week long ordeal partly because it was impossible (or it was believed to be impossible) to merge the change back in and everytime the change was redone someone had created a newer version.

Do we really need queryable data structures for code? I remember reading about a project that allowed developers to run queries over their java code .QL of Semmle (semmle.com/documentation/semmlecode/tutorials/o...) and have made it work with our plain text based code.

I don't agree with the term 'flat files'... source code tools use AST representions of code - the T in AST stands for Tree. Trees ain't flat. Modern IDEs like Eclipse make it easy to access the AST of source or compilation units, and make queries of these units fast by indexing the code. With that in mind, a source file on your HD is a tree structure... a lazy tree structure; if you want to access it as a tree, a parser has to turn it from it's flat representation into an in-memory tree representation. Storing an AST in a binary representation on the disk is simply an optimization... ie. caching the output of a parser.

With this point of view, Eclipse's workspaces are actually image based - except that on the disk, you only see "flat" source files; the image part is done by a) indexing and b) caching the result of the indexing process on disk.

Actually, you can approach this from the other direction too, as a group at IBM is now doing: they make Smalltalk code inside an image available in source file form by treating the Smalltalk image as a kind of file server. Actually, before I botch the description, here's a podcast with one of the members of the team:www.cincomsmalltalk.com/blog/blogView?showComme...

Like I said, (reading the plain-text code and) storing the code into DB was as easy as every programmer can DIY. Then we can query them freely.PS. suggested table structures:classes(id,name)imports(class_id,package_id);methodes(class_id,name,parameters,body)packages(parent_id,name);...Let's do it.

a source file on your HD is a tree structure... a lazy tree structure; if you want to access it as a tree, a parser has to turn it from it's flat representation into an in-memory tree representation.

Hmmm, this doesn't count as a tree structure to me, any more than a bag of flour, a couple of eggs and some milk and baking powder count as breakfast. Sure, they can become breakfast, but they're in an intermediate form that requires some processing before they're in their final form.

That said, I think that Herr Schuster has a good point – why not start with creating toolsets that handle source code in a pre-parsed form, and just front them with a data source that knows how to convert flat files? If someone then wants to store the in-memory representation as a BLOB in a database somewhere, then it's just a different data source to retrieve it.

I think its an interesting idea, and I don't understand the potential problems deriving from source control. It seems to me that source control could be more robust and might even be able to make sense of interrelated changes, such as code refactoring.

The real consideration I see is newness. Of course existing source control tools would have issues. Just as eisting text editors would. But I don't really consider these problems. When cosidering the value of this idea I prefer to consider the ideal implementation of it - in which it is accepted practice - since I can then consider whether it is something worth thinking about.

Thanks for bringing this up. Indeed SemmleCode allows you to query any aspect of your code, whether it is to gain insight, to audit code quality, to guide refactoring or to enforce policies. Everything (yes, even JavaDoc and XML config files) is searchable. Semmle has just released SemmleCode Professional Edition, at semmle.com.

Is your profile up-to-date? Please take a moment to review and update.

Email Address

Note: If updating/changing your email, a validation request will be sent

Company name:

Keep current company name

Update Company name to:

Company role:

Keep current company role

Update company role to:

Company size:

Keep current company Size

Update company size to:

Country/Zone:

Keep current country/zone

Update country/zone to:

State/Province/Region:

Keep current state/province/region

Update state/province/region to:

Subscribe to our newsletter?

Subscribe to our architect newsletter?

Subscribe to our industry email notices?

You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.

We notice you're using an ad blocker

We understand why you use ad blockers. However to keep InfoQ free we need your support. InfoQ will not provide your data to third parties without individual opt-in consent. We only work with advertisers relevant to our readers. Please consider whitelisting us.