Answered by:

Query a document library to find dupliates and remove them

Question

We are trying to remove duplicate documents from our document library. The duplicates will be pulled based on a couple column values that match among documents found within the library, Such as:

- Customer ID

- Document Date

- Document Description

This removal process needs to be done carefully and unfortunately the actual delete of the doc needs to be done manually, once a user determines it's a true duplicate.

So, I need to generate list of all the potential duplicate documents based on the criteria noted above so a user can then manually go through the list of documents (which would be sorted also by the criteria above) to determine if the documents
are, in fact, duplicates and then delete any copies leaving just one.

So Far,

I have a sql query that will return the duplciates from our doc library, but I haven't found a way to apply this within our SharePoint site. I tried an external list, but since the results are returned as an actual sharepoint list, the
user can't view or delete any documents from the doc library when going through the list. I looked at the Content Query Webpart, but from what I could tell I couldn't set up the query to compare column values among documents, I could just add a
filter value for a column.

Any thoughts on how I can accomplish this? Hopefully using an OOB web part!

You can use PowerShell to get all the items from the list and group them by one or more fields. I've created an example, below, that prints the names of all the documents in the library, then groups all "duplicate documents" based on the CustomerId
and DocumentType being the same. Finally, the script prints each duplicate document, along with it's listitem id. As Alex suggests, you could use the listitem ID to update a boolean field on the listitem to indicate it's a duplicate.

Thanks for the reply.. I looked at this option, but with using the group I can't have it return just the documents that are potential duplicates. It returns everything in our doc library (which is a lot) and sorts it based on the grouping.

I'd like to just have it return documents that match on Customer ID and Document Description and return nothing if one only document is found. A list of duplicates will be considerable shorter and easier for a user to go through.

I didn't see a way where I could further filter on group count to only display if it's greater than 1.

Firstly, going in through SQL isn't something you should be doing often. It's bad practice and can lead to stability/performance issues. Querying is bad enough but never write to it.

If you have to do it through the UI the way i'd do it is to create a yes/no column called 'duplicate' and a view that only shows items with 'Yes' selected.

I'd then use powershell to identify all the items that match your list of IDs/descriptions and set the duplicate column for those items. They will then show up in your view.

Once you've manually deleted your documents you can delete the column. For a longer term/nicer solution you could use PowerShell to do the identification and marking all as one step, without the need to go into SQL.

You can use PowerShell to get all the items from the list and group them by one or more fields. I've created an example, below, that prints the names of all the documents in the library, then groups all "duplicate documents" based on the CustomerId
and DocumentType being the same. Finally, the script prints each duplicate document, along with it's listitem id. As Alex suggests, you could use the listitem ID to update a boolean field on the listitem to indicate it's a duplicate.

I see where this apporach is helpful in finding the duplicates and using a column to mark a document as a dup so it can pulled and grouped into a view. However, if I use Powershell and update in this manner, is there a way to have PS run this
script so when the view is accessed by my user in the SharePoint UI, it will be refreshed so the user is looking at the most recent list? Or would this script need to be run separately and the user would work from a list until it's updated?
This realtime updating of the view was why I was intially thinking I had to link the view to a sql query (read only) to allow the view to refresh at the point of access (similar to how an external list works).. I agree it's not a great option, but I
couldn't see anohter option... So if the PS will get me there, I'm all for it.

Microsoft is conducting an online survey to understand your opinion of the Technet Web site. If you choose to participate, the online survey will be presented to you when you leave the Technet Web site.