Finding Duplicate Points in a Shapefile

[Editor's note: When building the 6,600 cities for Natural Earth vector, we had 6 extra townspots than town labels. Bound to happen on larger projects. One could take the halving approach and select half, see if the number of symbols matches the number of text objects, if so skip, if not subdivide in 1/2 again and reevaluate. Or if you use MaPublisher with Illustrator and/or Vectorworks to export out as a SHP file, we can open the DBF up in Excel and use the "countif" function and "conditional formatting" to quickly identify the exact features to resolve. By sorting the resulting "true" and "false" columns on lat, long, and feature name, we can quickly evaluate if there are multiple features at the same geographic location and compare their names. If they are the same name, assume 1 is a duplicate and remove it.]

You can locate duplicates in a range of data by using conditional formatting and the COUNTIF function. Here are the details on how to make that work.

Set up the first conditional formatting formula

I’ll start by setting up a conditional format for the first data cell. Later, I’ll copy that conditional format for the whole range.

In my example, cell A1 contains a column heading (Invoice), so I will select cell A2, and then click Conditional Formatting on the Format menu. The Conditional Formatting dialog box opens. The first box contains the text, Cell Value Is. If you click the arrow next to this box, you can choose Formula Is.

After you click Formula Is, the dialog box changes appearance. Instead of boxes for between x and y, there is now a single formula box. This formula box is incredibly powerful. You can use it to enter any formula that you can dream up, as long as that formula will evaluate to TRUE or FALSE.

In this case, we need to use a COUNTIF formula. The formula to type in the box is:

=COUNTIF(A:A,A2)>1

This formula says: Look through the entire range of column A. Count how many cells in that range have the same value as cell A2. Then, compare to see if that count is greater than 1.

When there are no duplicates, the count will always be 1; because cell A2 is in the range, we should find exactly one cell in column A that contains the same value as A2.

Note In this formula, A2 represents the current cell — that is, the cell for which you are setting up the conditional format. So, if your data is in column E and you are setting up the first conditional format in cell E5, the formula would be =COUNTIF(E:E,E5)>1.

Choose a color to highlight duplicated entries

Now it is time to select an obnoxious (that is, obvious) format to identify any duplicates that are found. In the Conditional Formatting dialog box, click the Format button.

Click the Patterns tab and click a bright color swatch, like red or yellow. Then click OK to close the Format Cells dialog box.

You will see the selected format in the preview box. Click OK to close the Conditional Formatting dialog box, and…

Nothing happens. Wow. If this is your first time setting up conditional formatting, it would be really nice to get some feedback here that it worked. But, unless you are lucky enough that the data in cell A2 is a duplicate of the data in some other cell, the condition is FALSE and no formatting is applied.

This entry was posted
on Wednesday, September 2nd, 2009 at 7:00 am and is filed under Best practices, Mapping, php scripting, Software.
You can follow any responses to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.

One Response to “Finding Duplicate Points in a Shapefile”

You can also sort your field and use the EXACT function to compare a value in a cell to the value above or below it, or to a cell in a different field. Returns true or false. Copy the formula down for your entire range and look for the true statement.