Using Excel to Count Text Occurrences

SEARCH BLOG

[UPDATE 1/19/2015: A couple of comments have pointed out that COUNTIF would address this scenario in a single formula. That’s a great point…and one that had not occurred to me. The overall scenario here is pretty straightforward, so there are likely other equally efficient (or more efficient) ways to address the task. I’m leaving the post as is, because I think it’s a useful exercise on how nested Excel formulas can be used to parse text. And I’m also curious what other solutions might get proposed in the comments.]

I had this come up a couple of weeks ago with a client, and I realized it was something I’d done dozens of times…but had never written down the “how” on doing. So, here we go. This is a post about one very specific application of Excel, but it is also implicitly a post about how, with an intermediate level of knowledge of Excel, with a little bit of creativity, and a strong aversion to manually parsing/copying/pasting anything, a spreadsheet can accomplish a lot! And very quickly!

The Use Case

The use case where I’ve used this approach most often is with social media exports — most often, with Twitter. In the most recent situation, my client had an export of all tweets that used a specific conference hashtag. Her organization was trying to introduce a secondary (relevant) topic to the conversation around the conference, and they had a separate hashtag. So, she was looking to identify, from the 16,000 tweets at the event, what percent of them also included the hashtag that her organization was interested in? That’s a simple and reasonable ask, and, if the tweet volume is reasonable (let’s say less than 500,000), easy enough to do in under 2 minutes in Excel.

The Example

Obviously, I’m not going to use my client’s data here. But, it turns out that my own tweets are a reasonable proxy. I tweet sporadically, but I know that a decent chunk of my tweets use the “#measure” hashtag. So, how many of my tweets use that hashtag? Thanks to http://analytics.twitter.com, it’s easy enough for me to get an export of my tweets. I just exported the default, which was 1,356 tweets going back to early October 2013. Opening the .csv in Excel, it looks like this:

Simple enough. I just want to go through and add a flag to identify every row where column C contains the word “#measure.”

Step 1: Make It a Table and Add a Column for the Flag

This step isn’t strictly necessary, but Excel tables make soooooo many things more easy, that I’m including it here. If you’re, like, “What are you talking about? Isn’t data in rows and columns in a spreadsheet a ‘table’ already?” well… stop reading this post and go read this one. It’s two clicks to make the data into a table, so do that…and add a column where we’re going to put our flag:

Simple enough. I just want to go through and add a flag to identify every row where column C contains the word “#measure.”

Step 2: Use FIND() to Look for ‘#measure’

This is the core formula. All we need to do is add the FIND() formula to the rows in the first column to search column D (“[@[Tweet text]]”) for occurrences of “#measure:”

Once we add that formula to cell A2, it will autofill for all rows in the table and the table will now look like this:

That’s kind of ugly, isn’t it? But we now know that rows 7, 8, 12, and 19 all included the word “#measure,” because the FIND() formula tells us where in the cell the word started. All of the other rows didn’t include the word “#measure,” so they returned a #VALUE error.

The bulk of the work is done…but we’re not quite there yet, because we don’t yet have a pure “flag.”

Step 3: Use ISERROR() to Make a Flag

We can nest our original FIND() formula inside an ISERROR() formula. If we do that, then all of the #VALUE values will instead show as “TRUE,” and all of the situations where the FIND() formula returns an actual number will show as “FALSE.”

The resulting table:

<Whew> Isn’t that cleaner? Now, every value is either “TRUE” or “FALSE,” so we now have a true “flag.” But, this flag is a little confusing, because it’s “FALSE” whenever the tweet contains the hashtag “#measure.” That may be fine if we can just keep that straight and jump straight to step 5, but why not make it a bit more intuitive with one additional update to the formula?

Step 4: Use IF() to Flip the Flag

Since our ISERROR() is going to return a TRUE/FALSE response, we can nest the whole formula in an IF() statement to make those flags into a Yes/No flag that makes more intuitive sense:

The IF returns “Yes” instead of “FALSE” and returns “No” instead of “TRUE.” Not necessary for this exercise, but I went ahead and added a little conditional formatting to highlight the rows that include ‘#measure’ (based on whether the Column A value is “Yes”):

Step 5: A Case-Sensitivity Precaution

In this example, all of the tweets are my own, and I always use an all-lowercase “#measure.” But “FIND” is case-sensitive, so, what if I had used “#Measure” a few times? Or “#MEASURE?” Those would be mis-flagged using the above approach. So, it’s worth one more tweak to the formula to force the entire tweet to be all-lowercase before running the FIND() formula on it:

Note how the LOWER() addition is inside the FIND() function. Since Excel uses parentheses like plain old math does, the innermost parentheses (functions) will get evaluated first, and the first thing we want to do is make the tweet text all lowercase.

Step 6: Summarize with a Pivot Table

There are lots of ways this data could be summarized. You could just sort the table descending by the first column and see what row the last “Yes” occurs on. You could have used “1” and “0” rather than “Yes” and “No” in the formula and then just summed column A.

But, I’m never one to miss an opportunity to apply a pivot table. In a handful of clicks, we get our summary:

Voila! 14% of the tweets in the data set included the string “#measure” (regardless of case usage).

In Reality: Six Steps Were Three

When I most recently did this for a client, it wasn’t really six steps. It was three: 1) create the table, 2) plug in a formula, 3) generate a pivot table. But, I realize that just throwing out =IF(ISERROR(FIND(“#measure”,”LOWER([@[Tweet text]]))),”No”,”Yes”) can be a little intimidating. I do regularly iterate “from the inside out” when building formulas. The result can look messy, but not as messy as manually inspecting tweets!

Now…chime in with the other 10 ways this exercise could have been approached entirely differently!

10 Comments

GilesJanuary 19th, 2015

Now…chime in with the other 10 ways this exercise could have been approached entirely differently!

–> How about =COUNTIF([range],”*#measure*”)

Andy BattenJanuary 19th, 2015

Ditto; I find COUNTIF/IFS with wildcard match to be easier since it won’t produce an error, thus you can eliminate a formula from the example above.

As soon as you tell me that I’m more likely to be able to fully grok Python than I was able to comprehend R, Randy, I’ll track down a Coursera course and have at it. I aced the R course…but still wasn’t able to actually apply it.

If I used IFERROR, it would actually return the position of the string when the string existed, and I was looking for more of a binary flag. Right?

atteboyskieMarch 5th, 2015

Following your approach, this would be a little simpler:

=IFERROR(IF(SEARCH(“#measure”, E4, 1)>0, “Yes”, ),”No”)

The “SEARCH()” function works the same way as “FIND()” but isn’t case sensitive, so you could drop the “LOWER()” bit.

Or easier still, you could just filter the data (ctrl+shift+l), click the arrow at the top of the column with the tweets and search for “#measure” (again, not case sensitive). Then you could just select all of the cells in the column and look at the count at the bottom (or enter “Yes” in a column out to the side and drag it down to tag all of the cells that contain the value).

Tim has moved on from Analytics Demystified effective 12/31/2017 but his content lives on. If you have questions for Tim please send them to eric@analyticsdemystified.com directly and they will get routed.