Removing Comments

This is a discussion on Removing Comments within the C Programming forums, part of the General Programming Boards category; Good afternoon. I'm trying to remove comments from an input:
Code:
#include <stdio.h>
#define MAXLINE 100
int getLine(char*, int);
void ...

I notice that your for loop condition is *str != '\0', but you never actually change str. Consequently, str[i] eventually goes out of bounds. Perhaps you intended to compare with str[i] != '\0' instead, or something like that.

I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.

I think you need to decide: do you want to iterate over the string with a pointer or with an index? If you want to use a pointer, then get rid of the i index. If you want to use an index, then you should not be changing the value of str. (Well, you can do both, but you need to be extra careful, and there's no reason to complicate matters here.)

I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.

I have taken the liberty of converting to use array index notation. Now, the problem here is that your function does not remove comments. Rather, it removes comment markers by replacing them with spaces.

When I think about the process of removing comments, I tend to think in terms of a state machine: at first, we begin scanning in a "non-comment" state: in this state, whatever input we get is kept as output, until an opening comment marker is detected. When an opening comment marker is detected, we enter a "comment" state: in this state, whatever input we get is discarded (or replaced by a space), until a closing comment marker is detected. When a closing comment marker is detected, we re-enter the "non-comment" state.

One thing to note is that the state must persist between calls of the function since the opening and closing comment markers may be on different lines.

I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.

Removing comments from a source stream is easier to do if you write a very simple finite-state machine: read each input character one by one, decide what to do based on the state of your machine (usually just a loop), including what state to change to.

If you consider C and C++ style comments, then you have the following states, and changes to a new state based on the character seen:

Code:

NORMAL_CODE: Within normal non-comment code
/ ⇒ AFTER_SLASH
" ⇒ DOUBLE_QUOTED, note (A)
' ⇒ SINGLE_QUOTED, note (A)
others ⇒ NORMAL_CODE, note (A)
SINGLE_QUOTED: A substate of NORMAL_CODE, single-quoted strings
' ⇒ NORMAL_CODE, note(A)
others ⇒ SINGLE_QUOTED, note (A)
DOUBLE_QUOTED: A substate of NORMAL_CODE, double-quoted strings
" ⇒ NORMAL_CODE, note(A)
others ⇒ DOUBLE_QUOTED, note (A)
AFTER_SLASH: After a single slash in normal code
/ ⇒ CPP_COMMENT
* ⇒ C_COMMENT
others ⇒ NORMAL_CODE, note (B)
CPP_COMMENT: Within a // comment
newline ⇒ NORMAL_CODE
* ⇒ C_COMMENT
others ⇒ CPP_COMMENT, note (C)
C_COMMENT: Within a /* comment
* ⇒ C_COMMENT_ASTERISK
others ⇒ C_COMMENT
C_COMMENT_ASTERISK: After a * within a /* comment
* ⇒ C_COMMENT_ASTERISK
/ ⇒ NORMAL_CODE
others ⇒ C_COMMENT, note (D)
Note (A): Output the current character before the transition,
so that you keep (don't filter out) NORMAL_CODE.
Note (B): You need to output a slash before the transition, because
the slash that caused the original transition from NORMAL_CODE
to AFTER_SLASH was not output.
Note (C): You should output a newline before the transition.
The newline is not part of the comment, really; the newline
ends the comment, and the line the comment was on.
Note (D): No output needed. The asterisk was part of the comment,
and you skip comments.

The arrows define transitions to ne states (when a specific character is seen).

To understand how the code works, open up a example input in a second window, keeping one finger in the state above. Whenever you look at the next character of input, look below the state to see which state you need to move your finger to. The notes tell you if there are any side effects you also need to do. Then you just keep doing that until there is no more input!

One way to implement the above is a very straightforward loop. In pseudocode:

The above saves some code, because instead of duplicating the quoted cases in AFTER_SLASH, the above shares the same code by falling through to NORMAL_CODE in those cases.

While the above is very hard to understand at the first go, the entire comment-removing program, including the ability to ignore slashes and asterisks in string constants, is just 80 lines of code (nicely formatted, not compacted, no tricks)!

Finite state machines are a very useful tool for any programmer. I do recommend reading about them, even if you cannot grasp them at first. They are fundamentally simple, and most people use relaxed versions of them in their real life instinctively, all the time. It just tends to be difficult at first to wrap your mind around the concepts.

The hardest part is learning how to design a good state machine. The key, I think, is being very anal retentive about considering all cases, then switch to being as lazy as possible so you can trim out unneeded cases, often by folding them into existing ones.

Thanks for explaining me what a State Machine is and what I can do with it. I understood that I can control the flow according with the state I'm in. Also, I keep printing thanks to the stream provided by stdin and stdout. I'll follow laserlight advice about replacing the comments with blanks or discarding the input. By the way, how can I discard it? with fflush(stdin) ? many thanks.