Hobbyist blog about all things software. Any correspondence of this blog's name to names of individuals, organizations, companies, products or other blogs is pure coincidence.

Friday, April 22, 2011

Remove superfluous include files

After the introduction to dependency analysis using prolog in a previous post, we can discuss some more advanced applications of the prolog database that is generated by the tool.

Remove superfluous include files

Approaches to include file pruning

Before we dive into the code it's important to understand the limitations of what we are about to do. In order to remove #include statements from a C/C++ program, I'm aware of three techniques:

full parsing and semantic analysis (I'm not aware of any tool that implements this for C++ code). This would be the best approach (but it's very hard to implement. Edited to add: check out the CLang based include-what-you-use project!). It would be able to point out things like include files that can be removed because none of their contents are being referenced, or because a forward declaration suffices.

systematically comment out #include statements one at a time, and recompile the code. If compilation fails, keep the #include statement. If compilation succeeds remove the #include statement. This is implemented in the open source tool deheader. It is less general than approach 1. in that it cannot say which include files can be replaced with forward declarations. It will allow to remove include files whose contents are not referenced. Recompiling the code after removing an #include file can be time consuming. Note that it is possible to come up with situations in which removing an #include file can lead to subtle bugs without causing compiler errors (like selecting a different overloaded method).

remove #include statements that are implied by other #include statements. Start from file F1.h. Let F1.h include a file F2.h. And let F2.h include F3.h. There's not much point in having file F1.h also include F3.h because its inclusion is implied by including F2.h already. It is this technique that we can implement using analysis of the include file dependencies. It can serve as a preprocessing step to technique 2, to reduce the number of compilations that have to take place.

Prolog code

Feel free to point out bugs and suggest improvements :)
The following won't work correctly in the presence of cycles, i.e. include files that via other files eventually include themselves again (#ifdef guards then prevents the actual double inclusion). This happens in real code bases, but it is considered bad practice. Detecting include file cycles probably will be the subject of another dependency analysis blog post.
Note also that the tool currently ignores #ifdef and friends.

The starting point is minimize_includes. minimize_includes expects a File as input, and produces three lists: the first list called OringinalIncludes are the files it currently directly includes; the second list called MinimizedIncludes are the files it should include; the third list ToBeRemoved are the #include files which can be removed from file File. If nothing can be removed, minimize_includes will fail.

minimize_includes delegates the hard work to remove_implied_includes. remove_implied_includes iterates over the includes files of file File twice: once from front to end, removing each file that is already included by one of the files that follow in the list. This yields an IntermediateResult. Then this list is reversed, and we iterate again over the list, again removing each file that is already included by one of the files that follow in the list. Note that often multiple solutions are possible, but we stop searching after finding one solution.

I should also mention that on code bases with precompiled header files (e.g. stdafx.h on windows systems), the precompiled header files can end up being considered "removable". In practice the compiler expects to find them and they should not be removed from the code base.