I don't know about you, but my favorite thing to do is to pass off a built in Excel feature as one of my own genius report development ideas.

Me: "I made it so that you can double-click on any number inside the Pivot Table to drill into details"

Client: "ooh….nice feature Mike. You're a genius!"

Me: "Well what can I say...I give because I love."

.

Well that nifty trick of double-clicking inside a Pivot Table to drill down has a side effect; it leaves behind lots of worksheets. In most cases, we don't want to keep these sheets. In fact, they often become a nuisance, forcing us to take the time to clean them up by deleting them. This is especially a problem when you distribute pivot table reports to users who frequently drill into details. There is no guarantee they will every clean up the drilldown sheets. Although it is unlikely these sheets will cause issues, they can clutter up the workbook.

.

Today, I'll share with you a little VBA that removes the junk drill-down sheets left behind from double-clicking on a Pivot Table.

.

The basic premise of this VBA is actually very simple. When the user clicks into details, outputting a drill down sheet, the macro follows along and simply renames the output sheet so that the first 10 characters are "PivotDrill". Then before the workbook closes, the macro will find any sheet that starts with "PivotDrill" and deletes it. The implementation does get a bit tricky because you essentially have to two pieces of code. One piece goes in the Worksheet_BeforeDoubleClick event, while the other piece goes into the Workbook_BeforeClose event.

.

The first thing you'll need to do is enter this code into the Worksheet_BeforeDoubleClick event (for the sheet your Pivot Table sits in).

I would bet that every one of us has seen Customer lists from separate sources that can't be matched up because of spelling mistakes and inconsistent naming conventions and abbreviations. In fact, there are consulting firms whose entire business revolves around helping organizations reconcile their "Master Data". I can't count how many times I've seen entire projects launched to create a "gold source for master data". Yet here we are; manually matching up addresses, customer names, or people names from different data sources.

.

Well today, I'd like to introduce you to a free tool from Microsoft that can help ease your pain a little. This tool is a free Excel Add-in called Fuzzy Lookup. The Fuzzy Lookup Add-in enables you to perform similarity analysis to match textual strings that mean the same thing but are not spelled exactly the same. With this tool, you can match things like customer addresses from two separate data sources, or even find imperfect duplicates in a single dataset.

.

Understanding the Jaccard Index of Similarity

The magic behind the Fuzzy Lookup Add-in comes from the Jaccard Index of Similarity.

The Jaccard index, also known as the Jaccard similarity coefficient, was developed by Paul Jaccard as a way to document the distribution of different types of flora (yawn). Jaccard's index gave him a statistical way to measure similarities between sample sets. The gist of Jaccard's index is this:

.

You take two sample sets. You count the attributes shared by both samples (call this Z). You count the attributes observed in only one sample (call this X). You then count the attributes observed in only the other sample (call this Y). Finally, you can calculate the similarity index by using the three counts in this operation: Z / (Z+X+Y).

.

Confused? Don't worry. I have an example.

Let's say we want to compare the similarities between these two Excel nerds.

We define some attributes, then we tag each nerd with a 1 when each attribute evaluates to TRUE.

.

We can now get the count of the intersections and then perform the math.

As you can see, the similarity index is .57 or 57%.

The more similar two sample sets are, the closer Jaccard's index will be to 1.

In its simplest form, the same kind of analysis can be done on Textual Strings using the Jaccard Index. You can see in this example, the words hair and hare result in an index of .60.

By the way, in terms of textual matches, 60% is not a great score. In my experience, most textual matches below 70% index is suspect.

Understanding the Fuzzy Lookup Transformations

Ok. Now you know that the Fuzzy Lookup algorithm is based on the Jaccard Index of Similarity. But it's important to note that Fuzzy Lookup does not run on your raw data. Before running the Jaccard Index, Fuzzy Lookup applies some built-in transformation algorithms designed to improve its matching results.

The first thing to understand is that Fuzzy Lookup converts data using something called a Tokenizer. For example, the record {"DataPig", "Bacon Boulevard"} might be tokenized into the set, {" DataPig", "Bacon", "Boulevard"}. Each word is considered to be a token. Tokens are assigned weights based on frequency of use. High weights are applied to tokens that are uncommon. Low weights are applied to tokens that come up frequently. For example, frequent words such as "Boulevard" are given a lower weight than less frequent words such as "DataPig". Fuzzy Lookup gives you the opportunity to override the default token weights by supplying your own weights via an Excel Table.

.

Fuzzy Lookup also has algorithms and built-in dictionaries that allow for the automatic correction of spelling mistakes, word merge scenarios, string split scenarios, and string prefix corrections. All of these automatic transformation can be turned on and off by via the Fuzzy Lookup Configure dialog box. You can add custom Transformation rules by pointing the TransformationRowsetName property to your own Excel Table of rules.

Note: The sample workbook installed with the Fuzzy Lookup Add-in includes a Customization tab that shows you how to set up your own custom transformation table.

In this example, we have two Excel Tables (IndProfile and CustInfo). We want to match up the customer names.

.

Click the Fuzzy Lookup command button to launch the tool.

.

In the Fuzzy Lookup dialog box, do the following:

1. Select the two tables you want matched.

2. Select the columns that hold the data you want matched.

3. Click the join button to drop your two columns down to the Match Columns section.

4. Select all the columns you want included in your output. Be sure to include the FuzzyLookupSimilarity column.

5. Define how many matches you want per record. I recommend setting this to 1.

6. Define the Similarity Threshold. This is the Jaccard Index number. As mentioned before, in my experience, any match with an index of less than .70 is typically not that accurate. I usually set this to .60 just to see the results.

.

Once you have defined your selections, be sure to place your cursor in a new blank worksheet. Be warned, Fuzzy Lookup WILL OVERWRITE DATA.

Press the Go button to output the results.

You will see a new table that includes the Output Columns you selected (in step 4 above). The Similarity column will show you the Jaccard index for each match. Any row with no matches will get an index of 0.

.

It's important to review any matches with a similarity index less than 1.

Don't take for granted that Fuzzy Lookup will get everything right.

For example, you can see in Row 7 below, Fuzzy Lookup gave us a .86 index score on the match for NVISION Inc. and WORLD VISION INC. An index score of .86 is typically a good match. However, we can see that in this case, the two names are not the same.

.

If you want to perform a Fuzzy search for duplicates within the same table, simply reference only that table when setting up the search parameters. For instance, in this example we are looking for Fuzzy duplicates in the Org Name column of the IndProfile table.

.

I guess that's about it.

Fuzzy Lookup is one of those tools you really need to experiment with to get a feel for how helpful it can be.

Although it's not perfect, it has saved me hours of scripting and manual matching.

So before you burn out your eyeballs staring at two data sets again, take some time to play with Fuzzy Lookup. It may become your new best friend.

Hello all! Belated Happy New Year! I've been busy working on a new book and a few other projects. It's time to get back to blogging.

.

I recently saw this article explaining the difference between BI Analysts and Data Scientists.

On the surface, this article highlights the operational differences between BI Analysts and Data Scientists. But at the core, it's a view into how the BI industry has changed over the last few years. Here's a graphic from that article.

.

As I look at this graphic, I personally recognize many of the tasks outlined here.

I can remember a time when I felt the needs of my customers change from simple trending dashboards to more predictive analytics. In the last few years, many Excel analysts have been asked to do more Data Science'y kind of stuff. Stuff like:

Analyze the correlation between key demographic data and transactional metrics such as number of purchases, purchase amounts, and frequency of visits

Score customers based on customer satisfaction levels, likelihood to recommend, and capacity utilization

Analyze association rules; for example, if Customer A buys Product A, we can predict (with 90% confidence level) that there is an 85% likelihood that this customer will buy Product B.

Predict outcomes based on certain independent variables found through regression analysis; for example, a machine's likelihood to break down given a number of interrelated variables.

.

In the past, business intelligence primarily consisted of dashboards showing historic data on an agreed upon set of key performance metrics. Organizations today are demanding more dynamic predictive analysis, the ability to iteratively perform data discovery, and the freedom to take the hard left and right turns on data presentation. These managers often turn to Excel analysts to provide the needed analytics and visualization tools.

.

Over the new few months, I'd like to focus my blog posts here on some of the more advanced Data Science'y kind of stuff that we'll all be asked to do.

.

This leads me to the Excel Power BI Boot Camp I'm hosting with Bill Jelen (Mr. Excel) in Dallas from May 20th – May 22nd.

.

This 3-day event is aimed squarely at Excel analysts who find it increasingly necessary to:

Analyze large amounts of data and report those results in a meaningful way

Get better visibility into data with new Data Mining tools

Perform advanced Clustering and Market Basket analytics

Add interactive controls to your dashboards without VBA

Automate repetitive data cleansing and transformation tasks

Create eye-catching visualizations and Dashboards

Create map-based dashboards

Access external data sources to expand your message

.

Bill and I only have 25 seats left in the class. If you sign up before March 1st, you'll get a $100 discount off registration.

.

I'll be back tomorrow with an exciting new series of posts on the SQL Server Data Mining Add-in.

I didn't want to finish out the last full work week of the year without a blog entry.

So let me bring up something that has been bothering me for quite some time.

Take a look at these two statements, then point to the one that looks right to you.

.

If you're American, you probably chose the statement on the right. If you're English, I bet you chose the statement on the left. In researching articles for some of my statistics posts, I often encounter variations on how the abbreviated word for mathematics is used. American authors use the word math as in – "Do the Math". The English use the word maths as in – "Do the Maths".

.

As a proud overweight American, I find it odd to say "Do the Maths".

What's bothers me the most is the seemingly arbitrary plural designation given to the abbreviation for mathematics.

.

Here is the way I see it. Mathematics is not a pluralized word for Mathematic. You don't have 1 Mathematic and 5 Mathematics. You simply have Mathematics. In fact, the word Mathematics (the overarching name for the study of all things mathematical) is used solely in singular verb forms.

You would say "Mathematics is useful in business".

You would not say "Mathematics are useful in business".

.

Even the English agree.

They say "Maths is fun".

No one says "Maths are fun"

.

That's right, even though there is an S in Maths, it's not treated as a plural.

It's obvious that the English use the abbreviated term "maths" (with an S at the end) to account for the s in Mathematics. Ok, I see that. So the question now is why is there an S at the end of Mathematics in the first place? After all, we don't say Literatures or Musics.

.

Well, according to Wikipedia, The most probable reason is that the origin of the word Mathematics is the Greek word mathematika. In Greek, when a word ends in the letter A, that typically denotes that the word is plural. So when mathematika was being translated to English, the letter S was added to the end.

I checked my email today and saw a bevy of emails from friends and clients claiming that their workbooks broke over-night. When one person contacts me saying something is broken, my general attitude is "they're on crack". But when I get a flood of emails, that's a horse of another color.

.

Apparently, a recent Office Update introduced security "fixes" that break any workbook or code involving ActiveX Controls. I have to admit, I didn't even notice it. I guess my workbooks are sorely lacking in ActiveX Controls.

Most of us know that you can embed objects like Word Files, Outlook Messages, and even other Excel Files. Here's an example of a Word object I embedded in my spreadsheet. My customers can double-click on this object to open the Word document I've embedded. This is cool, but it sure does come out ugly. I'd like to make it more attractive.

.

.

Here are some steps I can take to make a better looking Embedded Object.

Click Insert->Object to open the Object dialog box.

In the Object dialog box, go the Create from File tab and click the Browse button to find the file I want embedded (I selected a document on the mating habits of koalas).

Click the Display as Icon check box.

.

Click the Change icon button.

In the Change Icon dialog box, select the icon that is just a blank space, and then clear the caption input so that no caption will be shown.

.Don't see a blank icon option?

You can click the Browse button on the Change Icon dialog box and choose any strange file type that doesn't have associated icons. My go-to file is the bootstat.dat file under the Windows directory (c:\windows\bootstat.dat). Again, you can pick any strange file type..

..

After pressing OK, I see an empty square. I can click on that square and go to the Format tab. On the Format tab, I click Shape Fill->Picture.

..

I can select a picture I want to use instead of that lame Word icon. I've got a cool picture of a Koala. I choose that and press the Insert button..

.

And voila! I've got a nice looking embedded object. When I double-click the object, my document on the mating habits of koalas opens.

.

Of course, you don't have to use pictures of over-sexed koalas.

You can use this technique to display your own application icons, info-graphics, or any other image that fits the theme of your report or dashboard. These embedded objects are great for distributing different kinds of files with your Excel workbook. You can use these to include things like help files, back-up documentation, change logs, etc.

Over the last few days, I've been fighting an issue with the comments on this blog.

Apparently, the spam robots have been posting over 25 comments per second on my blog, causing time outs and overloads. I'll never understand why anyone would think this practice of shot gunning comments would be an effective way of driving revenue.

.

In any case, I had to do some back-end cleanup of my WordPress SQL database and install a new Captcha mechanism to hopefully curb the spam bots.

.

I typically don't like Captcha tools, but I found one that fits the irreverent style of this blog

If and when you enter a comment, you'll need to solve a small puzzle.

Simply look at the picture on the right and click the image on the left that best corresponds with the picture.

No posts this week, but here's a gift for you. I made a Thanksgiving GIF you can share with the kids. I used a bunch of Excel shapes to make a cozy Thanksgiving movie. Who says Excel doesn't appeal to kids?