Database

Automatic Module Control Revisted

Source Code Accompanies This Article. Download It Now.

Back in 1988, DDJ published Stewart Nutter's article on a technique for automatically documenting C programs. In this article, Ron updates Stewart's program in order to maintain more than 117,000 lines of C source code. Not to be outdone, Kevin Poole updates Ron's program for use with VAX VMS and UNIX.

Ron is a faithful Dr. Dobb's subscriber, and has been a microcomputer aficionado for many years. He is a software engineer at SPEX in Edison, N.J. You can contact him at P.O. Box 4143, Metuchen, N.J. 08840.

This article is a follow up to Stewart Nutter's article "An Aid To Documenting C," DDJ, August 1988. In that article, Stewart presented a printer program he called cp (for C printer), which allowed programmers to document C source-code modules. When I first saw the program, I realized that it addressed my needs in a general way, but as presented the program was just not sufficient.

After entering the source as printed (and adding a few missing ++s and fixing a few other minor typos) I got cp to run. Two things struck me immediately: Lowercase ls as variables are amazingly similar to 1s in small type, and the program crashed when I gave my target program as input! By this time my curiosity was sufficiently aroused, so I decided I had better understand the program before changing it too much, or it might never work.

What appeared to be a simple task became instead a small research project into the C language. It is amazing how much C code you can write without really understanding why C is the way it is. The process of correcting the code so that it would follow the rules of a C compiler and linker helped me understand the role of functions, and brought me a deeper understanding of computer languages and an appreciation for what they do. Nevertheless, you usually only go far enough to get the job done. In the case of cp, I realized that if I went much further, the program would start to become a compiler! That's a thought I try not to consider too seriously.

Rewriting Without Rewriting

When I began rewriting Stewart's original program, all of my prejudices and preferences in style came to bear upon the code in a fit of global search and replacements. My own C style is "pascalean" in indenting and braces, with highly descriptive functions and variable names, especially globals.

Every function in the original program is in my own version of cp, but the names have been changed to make them a little clearer (at least to me). Some new functions have been added. Function declarations are included, and two new files were added: cpbuild.c and cpinput.c. cpbuild.c contains all the code needed by the original function xref( ). The second file, cpinput c, handles the prompting and parameter parsing that the program does at start-up.

The make file now uses the /packcode directive in the linker command. This allows me to make all the functions "near," even in different files in large model programs, as long as the code size is less than a segment. The effect is to have the speed and size of a small model program with respect to the code. The arguments to the program are much the same as they were in the original program. Some new parameters are: n for normal (as opposed to IBM) character graphics, f for size of called function array, l for library call statistics, q for quiet mode, d to show declarations and definitions on the console as they are found, h to show more help than is shown when cp is executed with no arguments, and x to show some technical information. The program now can take uppercase or lowercase parameters with either the "-" or "/" switch character.

Figure 1 shows a sample printout of the the cp program; Figure 2 shows a portion of a typical report the program produces.

Repairing the Algorithms

A minor deficiency of the original code is that it incorrectly assumes that a function definition ends with a new line. A more correct algorithm scans forward from possible identifiers; if an open parenthesis is found, it then scans for the matching close parenthesis and checks the next non-white character. As in "Find That Function!" by Marvin Hymowech (also in August 1988), the key is to note that if this next character is a comma or semicolon, a declaration or prototype has been found. In either ANSI style or standard C, definitions have either an open brace or variable declaration(s) following the close parenthesis.

I had to overcome two major deficiencies in the original code. The first was the lack of treatment of static functions. The second was the way leading comments were stored. In the original code, the strdup( ) was not checked for failure at the end of the function xref( ), so it crashed on my large program. I made a minor change to the program that dressed up the form of the tree-structured output. The connecting lines did not stop when there was nothing underneath to connect them to. Finally, I added a toggle for IBM character graphics to draw the tree in fine style.

I discovered that you must really understand what constitutes a correct C program first before you can parse it for function definitions and function calls. The scanner, called getnext( ) in the original, incorrectly scanned quoted (") strings and pound sign (#) statements. They both allow the line continuation form of backslash followed immediately by a new line. Also, comments are considered white space in a #statement construction. The function is now called get_to_next_possible_token( ), and it handles these situations correctly.

I corrected the treatment of static functions. First, you have to recognize them, and then mark them when they are entered into the data base. It is also necessary to mark them in the called-function list with their file name, if they are called at any time in the file under analysis. In C, you can have many functions with the same name in a program, as long as only one of them is not static and each is in a different file. This requires a change to the way the defined function call count is done and the way the output tree is checked. Basically, the binary search of the sorted data base must be called with the understanding that there might be more than one defined function with the same name. If the called function is in a different file from the defined function and the defined function is static, it should not match. You must search adjacent names in the ASCII sorted list for a called function that is (first) static in the defined functions file, (second) not static and in any file, else it must be a library function. This also reflects upon the recursion check. All of these issues are addressed in the new version of the program. A function is recursive if it calls a function with the same name (perhaps through other functions) and the test checks that the file name associated with the called function is the same as the file name of the calling function. This avoids the trap of function x() say in file 1, calling function y() in file 2, and y() calling static function x( ) in file 2. The original code would say incorrectly that function x() was recursive.

A Few Extras

The point of making the called function array programmable is to allow you to shrink its storage requirements and thus allow the program to run in less memory. This allows the program to be launched with impunity from within editors, TSRs, Windows, DESQ-view, and so on.

As stated in the code comments, this program will not see the relationship between functions when they are called indirectly via pointers to functions. Also, functions in the body of a #define will be missed. Code in both #if and #else will also be scanned, possibly noting more function calls and perhaps even getting out of sync with respect to opening and closing braces. This may be annoying but as long as you are aware of it, I don't think it's too serious. If needed, you could pass the source through the preprocessor first before processing it with cp.

The code now catches all of the initial comments in a file. I am not sure what it did in the original code. The temporary buffer for it is 3K. Compile it bigger if you wish. The look-ahead buffer that scans between matching parentheses in a function declaration or definition is the manifest constant c_line_length, which is set to 512. If you are even more verbose in your style than I am, you may want to make this larger (a co-worker of mine is, so his is 2048!). The called function array size is the number of unique functions per function definition for all function definitions in the program, so the number of function calls can exceed this number. These arrays are mallocd so that if they need to exceed a full segment, one only has to change the malloc()s to halloc()s, the associated free()s to hfree()s, and then the really tedious part, chasing all associated global and local variables (usually pointers to these arrays) and changing all their definitions to huge, such as some_type *some_type_pointer to some_type huge*some_type_pointer. A large model recompile is all that is required. This is all Microsoft C 5.x specific talk, though I assume similar constructs exist in other Cs on the IBM PC/XT/AT platform. I would expect these huge arrays are not needed until one must analyze a really large C program, perhaps something on the order of 1-2-3 release 3!

The Loose Ends

The parser still worries me. It catches all the stuff I have thrown at it, but because I am still not sure how it works (!), I feel that it may still have some black holes. I can't follow its execution by looking at it; I just go on faith. The next version will probably return a space for any white space character. That is, put the white space testing into get_to_next_possible-token(). This should clean up build_the_data_base() a little.

I would like to contemplate the data structures a little and perhaps come up with something a little less ugly than what exists now, especially the structure elements that mark statics. There must be a neater way of doing this. You may want to add yet another input toggle to plot unused functions. This is useful in a C program that uses pointers to functions. You may not be able to see who calls the function, but you can plot it anyway.

For those of you with really big programs, you could go to halloc() for the arrays. After qualifying the declarations of pointers with the huge attribute, a large model recompile is all that is required to allow arrays larger than the 64K limit imposed by malloc().

Two more items complete the wish list. The first is hooking in the page linker to the library calls in doprint() so that a cross reference to library calls may be generated. The last is to sort the defined functions on entry rather than at the end. This would free up run-time dynamic memory and would not create a significant time penalty; in fact, it might actually speed things up!

Late Additions

At the request of a co-worker (yes, the same one!) the input buffer for the input file name list was extended from 20 bytes to 128 bytes. Apparently his sources were really scattered about his directory tree, and so some of his pathnames were quite long, hence the change to the buffer size. A small change was also made to allow an optional string following each input file pathname. This string is atoi()d, and if it is not 0, it is added to the tree structure defining box and added as a column in the sorted function list. Its intended usage is the overlay number of the function. Overlays are the trick that allows you to write code and .exe files that may wildly exceed 640K or whatever your memory space is. This is yet another argument to the program, its default is off. The reason for this is to study the program flow of the tree diagram to check for one overlay, calling another in order to prevent thrashing in a loop, speeding up execution by combining them into one overlay and so on.

I imagine other uses could be put into this extra information, the interpretation is up to the user. The same (!) co-worker also asked that the unused function list be sorted by filename. Yes, he had a lot of them! Because the sorting routine was already in place for the used functions, it was simple enough to clone it into the unused list display function.

Last Words

I trust that more C wizards will pop up and carry this ball a little further. I only carried it as far as I needed for my purposes. The C program that cp is now maintaining is over 4.2 million bytes, over 117,000 lines, 198 files, uses 993 defined functions called 4600 times, with 192 library functions called 3118 times! As you can imagine, this program does a wonderful job at wearing out printers! Thanks to Stewart for the great idea and for doing all the initial dirty work.

C Printer for VMS and Unix

Kevin E. Poole

Kevin is a software design engineer on the Boeing automated software engineering (BASE) project. The BASE project increases quality and productivity in the development of embedded computer software. The capabilities presented in the article are part of the BASE documentation production system. Kevin can be contacted at 14424 34th Ave. S., #2, Seattle, WA 98168.

I modified Ron Winter's PC/MS-DOS C Printer Utility (CP), presented in the accompanying article, to run on two additional operating systems: VAX VMS and VAX Unix. The modifications are divided into four sets of steps: The first set involves creating the makefiles, the second set involves modifying the C source code, the third involves compiling and running the new CP, and the fourth set provides two optional enhancements to CP. The listings in my version were developed using DEC VAX VMS 5.1-1 and DEC VAX Unix BSD 4.3.

Creating the makefiles

Step 1: An MMS utility description file was created for the VMS operating system, as shown in Listing One (page 86).

Step 2: The make utility makefile shown in Listing Two (page 86) was created for the Unix operating system.

Step 3: CP uses command-line options, so the following line should be added to the VMS login.com file to create a foreign command to run CP:

CP == "$DEVICE:[YOUR_ ACCOUNT .CP]CP.EXE"

Step 4: Because the C Printer executable is called "cp," it was necessary to change it to something else so that it would not be confused with the Unix cp command. CP (printed in bold in my Listing Two) was changed to mycp. You can name it whatever you like.

Modifying the C Code

Because of space constraints, the entire source code for the VMS and the Unix version of the cp program are not included with this article. Instead, I've provided code that should be inserted into Ron's program, as well as code that should replace portions of his listings. Ron's listings will be prefaced by the letters "RW" (RW-Listing One, for example) so as not to be confused with the listings in this article. However, the complete system is available on the DDJ listings disk, the DDJ Forum on CompuServe, and on the DDJ Listing Service.

Step 1: Define one constant per operating system using C preprocessor #define commands. These constants are used in conjunction with the #if and #endif structures to surround code that is to be conditionally compiled. The lines in Listing Three (page 86) should be inserted into RW-Listing One between lines 3 and 4.

Step 2: The Microsoft C function declarations contain the reserved word "near," which is not supported by VMS or Unix. The function declaration blocks at the top of each source file ending in ".c" must be duplicated. Surround one block with the #if MSDOS and #endif pair. Surround the other block with the #if VMS and #endif pair. From the VMS block remove all instances of the "near" reserved word. The lines in Listing Four (page 86) should replace lines 27 through 52 in RW-Listing Two. The other source files must also be modified in this way. The Unix compiler produces syntax errors on function declarations, so they were omitted.

The Microsoft C function definitions also contain the reserved word "near." Function definitions containing the "near" reserved word must be duplicated. Remove the "near" reserved word from the duplicated definitions to support VMS and Unix. Surround the original and duplicate definitions with the #if MSDOS, #elseif, and #endif structure as in Listing Five (page 86), which replaces line 56 in RW-Listing Two. All function definitions in source files ending in .c should be changed in this way.

Step 3: Each compiler supplies a different set of include files. Where the include file names are the same, the contents most often differ. The lines in Listing Six (page 86) replace lines 5 through 9 in RW-Listing One and the lines in Listing Seven (page 86) replace line 25 in RW-Listing Two to make the necessary include-file modifications.

Step 4: A few library functions in the MS-DOS version were not found in either of the VAX C libraries. The strdup( ) function must be added to the code for VMS and Unix to duplicate the functionality of the missing library function. The function in Listing Eight (page 86) should be inserted between lines 824 and 825 in RW-Listing Two.

The CP report header contains the current date and time that is provided by the MS-DOS _strdate() and _strtime() functions. Although not in the same format, this information is provided by the time() and localtime() functions in VMS and Unix. A new function was created for VMS and Unix using the appropriate library calls. The MS-DOS code was moved into a function body and replaced by a call to the new function. The function call in Listing Nine (page 86) replaces lines 233 through 256 of RW-Listing Two and the functions in Listing Ten (page 86) should be inserted between lines 210 and 211 in RW-Listing Two.

Step 5: The MS-DOS and VMS tolower() library function returns the lowercase equivalent of an uppercase character or the same character if it is already lowercase. In Unix the function works the same if the character is uppercase, but it makes a lowercase character even lower, returning some character that is not a valid command-line option. Replace line 93 in RW-Listing Three with the code in Listing Eleven (page 88), which uses the islower() function to check the case of a character and passes the character to the tolower() function only if it is uppercase.

Step 6: CP directs output to the CRT by specifying that the outfile be "CON." On PC/MS-DOS, CON is a reserved system device and is therefore automatically assigned. On VMS and Unix the code in Listing Twelve (page 88) must replace line 850 in RW-Listing Two to facilitate this option.

Step 7: The Unix C compiler detected an error that neither the Microsoft nor the VMS compiler found. Replace line 868 in RW-Listing Two with the line in Listing Thirteen to remove the use of a variable called "errno" that was used but never declared and could cause a run-time error.

Step 8: Memory size limitations are not a concern on the virtual machines for the requirements of the C printer. Replace lines 95, 117, 140, 163, 188 in RW-Listing Two with the line in Listing Fourteen (page 88) to use the MS-DOS constant to inhibit CP's byte limit errors on VMS and Unix.

Compiling and Running CP

Step 1: After completing the steps required to modify the C source code, the code can be compiled on all of the supported operating systems using the make utilities and files.

Step 2: CP does not interpret C preprocessor directives, so compile the code with the C preprocessor and then run CP on the preprocessed code. Listings Fifteen (page 88) and Sixteen (page 88) contain the commands needed to run the C preprocessor on the CP source files on VMS and Unix, respectively.

Step 3: The cp.cpi file shown in Listing Seventeen (page 88) must be used as the list file to run CP on the preprocessed source code. If run on VMS or Unix, the -n option should be used to suppress IBM-type graphics characters in the output.

Enhancing CP

The following enhancements to the C Printer were developed to support the construction of design documents for software created by The Boeing Company.

Step 1: Path names can be very long in hierarchical directory structures, so it is necessary to modify CP in order that it will accept these long names in its list file. Insert the line in Listing Eighteen (page 88) between lines 18 and 19 of RW-Listing One and change the value of the LEN_FILE constant to the maximum size needed.

Step 2: When using long file names, the boxes that are displayed by CP get overrun. The b option has been added to CP to allow the size of the box to be varied at run time. See Listings Twenty-One through Twenty-Nine (pages 88-89) for the code and instructions needed to implement this option. The bounds-checking values (Listings Twenty-Eight and Twenty-Nine) and the default value (Listing Twenty-Five) can be changed to suit your needs. Listing Twenty-Seven will correct two errors with the help screen that were most likely inadvertently left out of the original program. Due to lack of space, the details of the code will not be discussed. If you have any questions about this option or about any aspect of the C Printer write me at the address given at the beginning of this article.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!