Counting the Number of Words in a Text File in STL / C++

This post aims to illustrate the power of using STL’s associative arrays as a word counter. It reads the entire contents of the text file, word-by-word, and keeps a running total of the number of occurences of each word. All using just a few lines of code, discounting the bits that output the results.
There is a deceptively simple line of code

counter[ tok ]++;

that looks up the counter value for a given word and increments its value. It basically navigates the red-black tree used by the standard map to find the appropriate tree node, selects the T portion of the pair <Key,T>, and then increments it.

The WordCounter also needs to have the default constructor in order to set the counter value to zero. Without this, when performing the counter[ tok ]++; code for the first time for a particular word, this initial integer value will be set to some arbitrary value contained in the memory.

As an example, let’s apply this to a sample piece of text, an excerpt from Shakespeare’s Hamlet. (Click here for Richard E. Grant’s superb rendition by Richard E. Grant.)

I have of late – but wherefore I know not – lost all my mirth,
forgone all custom of exercises;
and indeed it goes so heavily with my disposition that this goodly frame,
the earth, seems to me a sterile promontory;
this most excellent canopy, the air, look you, this brave o’erhanging firmament,
this majestical roof fretted with golden fire,
why, it appears no other thing to me than a foul and pestilential congregation of vapours.
What a piece of work is a man! how noble in reason! how infinite in faculty!
in form and moving how express and admirable!
in action how like an angel! in apprehension how like a god!
the beauty of the world! the paragon of animals!
And yet to me, what is this quintessence of dust?
man delights not me: no, nor woman neither.

To get a word count something like this:

And so on…

Further improvement: removing unwanted characters

The output may be further refined by including a kind of filter to strip out any unwanted characters like ‘?’, ‘;’ etc and improve the output formatting by means of the setw routine.

Giving the following output, with extra padding and minus the extraneous characters:

Further improvement: sorting the words and their counts

Now that we have the full set of words and their frequencies, we may wish to make the output more presentable by sorting them. As described on this other post, you cannot just sort a std::map like you can with a std::vector, since a std::map sorts its elements by key. You have to first insert the std::map key-value pairs into a std::vector, and then sort that std::vector with some kind of comparison function or functor, as shown in this next example:

Hi guarang
1. This is an output streamer for printing the contents of the WordCounter ‘value’ part. This is getting implemented in the << (*it).second part of the 'for' loop. The (*it).second is the WordCounter bit of the std::map being used.
2. It just so happens that the default for the third template parameter of std::map is std::less, you probably don't even need to define it here for this simple application.
3. Is this a question or your opinion? All this bit does is shove each line read from the input ifstream into a string called 'tok'.