String Parsing

This is a discussion on String Parsing within the C++ Programming forums, part of the General Programming Boards category; Just to introduce myself again, My name is Shaun and I'm a 21 year old engineering student with about... six ...

String Parsing

Just to introduce myself again, My name is Shaun and I'm a 21 year old engineering student with about... six weeks of c++ experience.

I main concern isn't my ability to get the job done. I have enough Matlab experience and logical thinking to accomplish this task. My worry is that with all the amazing libraries available for C++, there's probably a MUCH more efficient way of doing this.

The number of spaces is not consistent. I need to extract the data from column three and do n-gram analysis on it. If anyone is interested as to what exactly n-grams are and how they're used, I'll be happy to explain. However, for the sake of brevity, I'll just provide an example; this should be sufficient.

For the string "Shaun" I would need to produce

S
h
a
u
n
Sh
ha
au
un
Sha
hau
aun
Shau
haun
Shaun

I should point out that I did NOT stop there because I had reached the length of the word. No matter what the size of the string, I will only break it up into a maximum string length of 5.

So, using a column based approach, I was able to accomplish this using a combination of Matlab and Excel. However, I'd like to do it in Visual Studio C++ 7.1.

My idea is to first use regular expressions to look for a space followed by any number of optional spaces. I'd replace every match with a comma, thus giving me a file delimited by commas and not a varying number of spaces.

Next, I can use the ifstream.get() function to break up the columns, discarding the first and second column and writing the characters in the this column to an object str of the class string, while looking for a \n to stop on.

Once I have str, I can break it up using... some function. This is the part I really need your help on.

Once I have broken it up, I'll store the pieces somewhere (I can do this part later, it's more complicated and is my task for next week) and then loop through again, discarding columns 1 and 2 from the next line and so on.

That's where I stand, I'm installing Boost right now and I'm reading up on the regular expression capabilities.

The operator>> stops at spaces by default, so it is rather easy to read in values separated by space. If you know there are three values to a line and none have embedded spaces then something like this would read in a line at a time (assume fin is an input stream):

Code:

while (fin >> val1 >> val2 >> val3)
{
// process the three values here, or ignore the first two and process the third
}

The operator>> stops at spaces by default, so it is rather easy to read in values separated by space. If you know there are three values to a line and none have embedded spaces then something like this would read in a line at a time (assume fin is an input stream):

Code:

while (fin >> val1 >> val2 >> val3)
{
// process the three values here, or ignore the first two and process the third
}

That's much easier than using get().

You're absolutely correct, and I actually knew that rule.

I shoulda caught that one. Oh well, thanks, that probably saved me quite a bit of time already.

You might want to be careful there: strlen is the name of a function from <cstring>, which some library implementations might include in <string>. Combined with the using directive, this could cause a name collision with your strlen variable.

You might want to be careful there: strlen is the name of a function from <cstring>, which some library implementations might include in <string>. Combined with the using directive, this could cause a name collision with your strlen variable.