Unexpected Results Grouping by Pronunciation or Common Characters

Published: 23 Apr 2018Last Modified Date: 26 Jun 2018

Issue

When you use Group By > Pronunciation, some spellings that seem obvious (like Organization and Organisation) may not be grouped, and some seemingly unrelated items are grouped together such as "International Organization for Migration" and "International Labour Organization" and "International Civil Aviation Organization". This issue is more prominent when using data in a language other than English.

Additionally, when you use Group By > Common Characters, some seemingly unrelated items are grouped together such as "Aberdeen" and "Bandera" and "Branden".

Environment

Tableau Prep 2018.1.1

Resolution

Manually correct the automatic groupings by deselecting values that should not be grouped or selecting values that should be grouped.

Cause

Group by Pronunciation considers the pronunciation of a prefix to the string (size depending on the size of the string) and is currently only reliable in English. For more information on the algorithm underlying this feature, see Metaphone3 at Wikipedia.

Group by Common Characters uses a 1-gram fingerprint method to group similar values. You can read more about the algorithm used at Clustering In Depth on GitHub. This may group strings that seem totally different. For example, "Tom Smith" and "Simo Mitthos" both produce a 1-gram fingerprint key of "himost", so these 2 values will be grouped together. This grouping method is most useful when misspellings are the result of the same set of characters in various orders, or for "Tom Smith" and "Smith, Tom" to be considered the same.

Did this article resolve the issue?

Thank you for providing your feedback on the effectiveness of the article.