Search Best Practices: Avoid Duplicates and Near-Duplicates

Last Updated Dec 2009

By: Miles Kehoe

One characteristic of enterprise search is that there users are often searching for a single document that they know exists, whether it is a proposal, a policy, or even the latest annual report. Yet, because the intranet makes it easy to share documents among co-workers, some content may be repeated two, three, or even dozens of times. And then there’s the issue of ‘near duplicates’ that can be caused by storing different versions of the same document on file shares.

There are at least four ways that can lead to duplicate and near-duplicate content on your intranet search results. No matter what the cause, this kind of thing bugs customers, frustrates content owners, and really angers employees: why can’t our search just be more like Google?

Let’s look at the most likely causes, and some ideas about how to avoid avoid them.

Duplicates

As is so often the case, there are a number of ways duplicate content can find its way into your result list. The obvious one is copies: you create a document and email it to your team, and they save it on their network drive. Your search engine crawls the file shares nightly, and by tomorrow, there are a number of duplicate entries for your document in the search result list. This is a tough situation for enterprise search engines to handle, and since the URL to each will be slightly different, it’s really a challenge. Consider the files /data/share/markb/plan.doc and /data/share/milesk/plan.docx: if I’ve saved a copy of a file from Mark’s directory into mine, everything is identical except the URL. Unfortunately, search spiders typically use the URL as the unique key, and the search engine sees these as two unique documents. About the only way to address this kind of duplicate is to maintain a checksum for each document; most commercial search technologies do not do this by default, so there may be some assembly required. You may also try the ‘brute force’ method of checking the result list for duplicates just before displaying it to the user – but this method has its own downsides.

A less obvious way you can see duplicate content is a function of both how you’ve designed your web site and how you’ve got your web server configured. Consider two URLs: http://www.contoso.com/index.aspx and http://www.contoso.com/: chances are, they both display the same content. Depending on your web server, you may have defined these two URLs as aliases for the same page. If you specify the starting URL for your spider as the first, the content will be indexed using that full URL. However, if somewhere in your site you have a hyperlink back to ‘Home’, the spider will treat the page as new, since the URL is different than the original starting page – it doesn’t have index.aspx, so it’s different. Instant duplicate content.

Sometimes your web server is configured to use a ‘re-direct’ between these two pages. When a user – or a spider – requests the first URL above, the web server redirects the request to the second page. For most search engine spiders/crawlers, the redirect is sufficient to identify the two URLs as the same content. In this case, no duplicates.

There is one more point we should cover here as a potential source of duplicate content which has actually been an issue for us on our original web site: Consider the Cotoso example above, but imagine that somewhere, deep in the site, a developer gets lazy and codes a link back to the home page as http://cotoso.com. The web server handles this link fine, so over time this ‘shortcut’ may even propagates widely. The problem is, your spider/crawler sees this much in the same way it sees the index.aspx page above. If your web server handles this as an alias, you’ll likely find a large number of duplicate pages; if it handles the URL as a redirect, you’ll likely safe.

Near Duplicates

As you’ve seen, handling duplicates can take some forethought when it comes to search technology, but it’s a problem that can be solved. A tougher problem is ‘near duplicates’ – documents that are almost identical but not quite the same.

Sometimes near-duplicates show up when you’ve gone through a number of revisions to a document. You start with ‘v0.01’ and end up with “V1.0”, but when your search crawler/spider visits your file share, it finds all of the different versions along the way. There’s probably no good way of handling this kind of problem, short of don’t do that’. You may find that you can arrange for the crawler to avoid temporary directories, or to only index specific directories where you keep your finished documents. If this problem is widespread in your organization, it may be time to look into a good content management system like Oracle Content management or Microsoft SharePoint.

There is another source of near-duplicate documents that’s even less obvious. In most companies, we’ve found that most folks create new documents by starting with an existing one. You find the last proposal your workgroup has done; copy it over, probably renaming it; then you customize the content as required – company name, product, delivery details, etc. Now you’ve got a proposal for your new prospect; but you also have a nearly duplicate document.

This kind of ‘create –by-copying’ has even a darker side: incorrect metadata. The document you started with was authored by Bill Smith for XYZ Corporation; when you copy it over and customize it for QRSCorp, there’s a better than even chance you do not go into the document properties to update the author and prospect fields. We’ve worked with a law firm where was partner profile had the name of a single junior partner in the metadata; he was very popular on the search result page! This kind of problem is really handled by content management systems; but in an ideal world you’ll at least try to help out your search engine by updating the metadata!

Action

Now you know some of the problems that can cause duplicate and near-duplicate content. Some user testing of your search engine can help you understand if it’s a problem you have; and we’ve given you a few ideas of how you might craft a solution.

If you have experienced other ways duplicate content gets into your search results, let us know - leave a comment, oremail us. And if you have any questions about solving your problem, feel free to give us a call +1+408-446-3460.