How Can I Tell Whether a Phrase Occurs At Least Twice in a Text File?

Hey, Scripting Guy! How can I tell whether or not the phrase 226 transfer complete occurs at least twice in a text file?

-- JR

Hey, JR. You know, this was a tough one for us to answer; that’s because the Scripting Guys are always content with just one of everything. Take the Scripting Guy who writes this column, for example. He has one son; he writes one column; he was right one time in his life. (Editor’s Note: When was that?) A second piece of pie? No, thank you; one is plenty.

Um, what kind of pie are we talking about?

The point here (assuming that there is a point here) is that determining whether or not a particular phrase occurs more than once in a text file is something a Scripting Guy would never do; we’re not the greedy type. On the other hand, though, if someone asked us for help it wouldn’t be very polite to ignore them, would it? Tell you what: we’ll see what we can do.

Oh: and we’ll take that second piece of pie, too. Just to be … polite.

As you pointed out, JR, this task is actually a bit more complicated than it might first appear. Sure, you can use the InStr function to determine whether the string 226 transfer complete appears in the file (although even there we have a problem, as we’ll explain in a moment). However, InStr simply gives you back a yes-no answer: yes, the target phrase was found, or, no, the target phrase was not found. What InStr won’t tell you is how many times the target phrase can be found.

Of course, we could try using the Split command to split the contents of the file on the target phrase; that would give us an array that – after a little mathematical wizardry – would eventually tell us how many instances of the target phrase occurred. Except for one thing: as JR noted, there’s no guarantee that all the words of the target phrase will appear on the same line of the text file. For example, suppose we had this very simple text file:

226
transfer complete

Does the phrase 226 transfer complete appear in this file? Believe it or not, it doesn’t: if you try splitting the file on the phrase 226 transfer complete nothing will happen. Why not? Because of the carriage return-linefeed that appears at the end of the first line. Technically, this is the string that makes up the contents of our sample file:

226 vbCrlf transfer complete

That’s a problem.

Now, admittedly, there are a couple of clever ways that we could manipulate this file and then still use the Split function to count up the number of times the target phrase appears. But the Scripting Guys are too lazy to do anything clever. Because of that, we used this script instead:

The secret to this approach is that we use a regular expression to search for the phrase 226 transfer complete. Why do we use a regular expression? That’s easy: regular expressions return a collection of all the matches found. To determine how many instances of our target phrase occur in the file all we have to do is determine how many items are in the collection of matches.

Of course, even with a regular expression we can’t just search for the phrase 226 transfer complete. Why not? Because we still face the problem of handling instances of the target phrase that get broken across lines:

226
transfer complete

Did we find a way to deal with that problem? To find out, just keep reading.

Note. Sorry, but we don’t want to spoil the suspense.

Let’s see if we can figure out how the script works. To begin with, we define a constant named ForReading and set the value to 1; we’ll use this constant in order to open the text file for, well, reading. We create an instance of the Scripting.FileSystemObject , then use this line of code to open the file:

Set objFile = objFSO.OpenTextFile("C:\Scripts\Test.txt", ForReading)

As soon as we have the file open we use the ReadAll method to read in the contents of the file and store that information in a variable named strContents. We need to do this because we can’t actually search the file itself; instead we need to search a copy of the file stored in memory. And because we can’t actually search the file itself we then use the Close method to close the file as soon as we’ve finished reading in the contents.

Now the fun begins. We start out by creating an instance of the VBScript.RegExp object. We then configure two property values for the regular expressions object:

•

IgnoreCase. We set this value to True, which means our search will not be case sensitive. (In other words, 226 Transfer Complete and 226 transfer complete will both register as matches.)

•

Global. We set this value to True to ensure that we locate all instances of the target phrase. If set to False the regular expression object would look for the first instance of the target phrase and then stop looking.

That brings us to this line of code:

objRegEx.Pattern = "226\W{1,}transfer\W{1,}complete"

As you might have guessed, the Pattern property represents our target phrase. If you squint your eyes and hold your monitor up to the light, you can probably see the words 226, transfer and complete in the value we’re assigning to the Pattern property. But what’s the deal with those two instances of \W{1,}?

Good question. We’ve already determined that we can’t just search for the phrase 226 transfer complete. Why not? Well, for one thing, the word 226 could be followed by a blank space; however, it could also be followed by a carriage return-linefeed. And that makes a big difference: 226 blank space is definitely not the same thing as 226 carriage return-linefeed.

Fortunately, regular expressions are designed to deal with ambiguous situations like that. What does \W{1,} mean? To begin with, the \ tells VBScript that the next character in the string is a special character; in other words, we’re saying, “Don’t look for a W. Instead, look for a ‘non-word’ character.” In regular expressions, a non-word character is any character that does not begin with a letter or a number. Neither the blank space nor the carriage return-linefeed begins with a letter or a number, so \W enables us to match either a blank space or a carriage return-linefeed.

Cool, huh?

So then what’s the {1,} for? What we’re doing here is specifying the number of non-word characters allowed to come after 226. The 1 tells the script that there must be at least one non-word character after 226. The comma followed by nothing tells the script that while there must be at least one non-word character there could be more than one; we’re fine with that. And that’s good, because, technically, a carriage return-linefeed actually consists of two characters: the carriage return and the linefeed. That’s why we can’t match just one character: criteria like that would find the blank space but not the carriage return-linefeed.

Note. So how do we know all this stuff? Well, to tell you the truth; we don’t. What we do know, however, is how to look up regular expression syntax in the VBScript Language Reference.

Of course, we need to put this same little construction – \W{1,} – after the word transfer. That’s because we could end up with either a blank space or a carriage return-linefeed in there, too.

From here on out it’s easy. With this line of code we call the Execute method; in turn, this causes the regular expression object to start searching through the strContents, looking for the Pattern we set just a second ago:

Set colMatches = objRegEx.Execute(strContents)

As we mentioned earlier, the Execute method returns a collection of all the matches found (a collection we named colMatches). To determine how many instances of the target phrase appear in the text file we simply echo back the value of the collection’s Count property:

Wscript.Echo "Total Matches: " & colMatches.Count

That’s all we have to do.

Well, that and go ahead and take a third piece of pie. After all, we don’t want you to think we left that last piece of pie because we didn’t like it. Although we’d just as soon not eat it we don’t want to hurt anyone’s feelings.