Removing Duplicate Data in Tableau Prep

Question

Duplicate data can occur in data for different reasons, and the best way to resolve the duplicates will vary. Below are the scenarios covered in this article:

When joining two tables that have a 1:many or a many:many relationship, then the measures values of the output are duplicated.

When unioning two or more tables, some records may be repeated between tables and thus are duplicated in the output.

Environment

Tableau Prep

Answer

CLICK TO EXPAND STEPS

Scenario 1: Duplicated Measures from a Join

In the Tableau Prep workbook "duplicates from join.tfl", the "January Inventory" table lists each product once with its total inventory. The "January Orders" table may have many rows for a product if multiple orders were for the same product. After joining the "Sales" and "Inventory" table, the measure [Inventory] has been duplicated. The first flow demonstrates the issue.

Option 1: Use a Union instead of a Join

Note: None of the values of [Inventory] will be directly related with any of the data from "January Orders". This may cause issues when creating calculations without aggregations, or in views that show a higher level of granularity then the field(s) from the join clause, which would be [Product] in this example

Option 2: Create a Calculation to Evenly Distribute Measure Values

We can use an aggregate step to find the number of orders per product and then divide inventory by the number of orders. Then final result will be that an equal portion of the inventory will be represented for each order. For example, there are 2 orders for hats. [Inventory] will be 100 for each hat order. [New Inventory] will be 50 for each order.

Please note: this may cause inaccurate inventory information in a view if only some orders for a product are filtered out.

Option 3: Use a FIXED expression in Tableau Desktop

The advantage of using a FIXED expression in Tableau Desktop is that level of detail (LOD) expressions have special aggregation rules. Therefore, a de-duplicated inventory field using FIXED could show the full, unduplicated inventory for a product whether there was only one order from that product in the view, or multiple.

CLICK TO EXPAND STEPS

Scenario 2: Duplicated Data after a Union of Data Sources with the Same Data

In the Tableau Prep workbook "duplicates from union.tfl", the "January Orders" is the first report of orders in January. The "January Orders (corrections)" table is a report released later with some new orders, but also corrections on orders that already exist in "January Orders". When the two tables are unioned, all rows from both tables are kept, which creates duplicated measures.

The first flow in "duplicates from union.tfl" demonstrates this issue.

Option 1: Keep all Data from one Table and Non-Matching Data from the Other Table

For this example, we want all of the data from "January Orders (corrections)" and only the orders from "January Orders" that do not exist in "January Orders (corrections)".

Create a new join between "January Orders" and "January Orders (corrections)"

Merge all matching fields, except the date fields and the measure values:

Click on the field whose values should supersede the other in the Profile Pane

Ctrl+click on the matching field

Right-click either field and select Merge Fields

When fields are merged, Tableau Prep does not keep both values (if there are two different values), but rather keeps the value from the field selected first if there is one.

For example, Ms. Horse's name was accidently recorded as Ms. House in the "January Orders" table and her correct name appears in "January Orders (corrections). When merging [Customer] and [Customer-1] I want to select [Customer-1] first.

You can verify which field will supersede the other from the name of the merged field. The name be formatted like [first field name & second field name]. The field name listed first will be the field that will supersede the other.

If the fields were merged in the wrong order, the merge can be undone by right-clicking the merge icon above the name of the merged field in the Profile Pane and selecting Remove.

Click the Create Calculated Field… button

In the Calculated Field dialog box that opens, do the following, and then click OK:

Name the calculated field. In this example, the calculated field is named "Most Recent Date"

In the formula field, create a calculation similar to the following:

IF [Date] > [Date-1]
THEN [Date]
ELSE IFNULL( [Date-1], [Date] )
END

The above calculation will return [Date] if it is "larger", or in the case of dates more recent, than [Date-1]. Otherwise the calculation will return [Date-1].

If either field is NULL, then the first condition will always be false. The IFNULL() function replaces [Date-1] with [Date] when [Date-1] is NULL. This means the calculation will always return a non-NULL value if there is one.

Remove [Date] and [Date-1]

Create a calculated field with a name like "New Cost" with a calculation similar to the following:

Option 3: Every Duplicate Should be Treated Differently

Merge all fields that either a) have the same values in both tables, or b) the values from one table should always supersede

Click the Create Calculated Field… button

In the Calculated Field dialog box that opens, do the following, and then click OK:

Name the calculated field. In this example, the calculated field is named "New Cost"

In the formula field, create a calculation similar to the following:

CASE [Order ID-1 & Order ID]
WHEN 5 THEN [Date]
WHEN 9 THEN [Date]
ELSE IFNULL([Date-1],[Date])
END

The above calculation will read through every value of [Order ID-1 & Order ID] and return the specified date field. To make things simpler, I only specified which IDs should return [Date] and then everything else will return [Date-1], or [Date] if [Date-1] is NULL, due to the ELSE statement.

The [Order ID-1 & Order ID] value is used in the CASE statement because this field will have a unique value for every row of data.

This also means that any new IDs added will return [Date-1] when this flow is re-run in the future.

Remove [Date] and [Date-1]

Repeat steps 4-6 for all fields where duplicates need to be handled on an individual basis

Did this article resolve the issue?

Thank you for providing your feedback on the effectiveness of the article.