I have been posed a query which I think will require some 'out of the box' thinking so thought I'd come here for some help.

I don't have a lot of details yet but should have more when I return to the office.

There is a thought that someone at my clients may have been double invoicing contractors and billing twice for the same work. It's been identified that some invoices have the exact same description for the work, but different invoice dates, numbers etc.

I have been asked if I can go through all the invoices and attempt to locate ones with duplicated descriptions. At this point I don't know if they are word, excel or non searchable PDFs, that was my second issue.

But I'm trying to get my head around a possible way to use software to automate this rather than a hard manual search through thousands of invoices.

I have considered using Xways or Intella and searching for the full description, however as there are multiple invoices and descriptions it's not a case of a single description being used over and over, there are potentially hundreds of different descriptions which may have been used only 2 or 3 times.

Ultra compare is a great little tool for comparing 2 or 3 files at a time but that doesn't really save me much time if I have 10,000 invoices.

Is anyone aware of software that can scan and compare numerous documents, and have the ability to filter the results based on say, number or matching words, or proximity matches?
I'm thinking if I can find some way to identify files that have say more than 20 matching words all withing 100 characters (or something like that).

I would have to change the parameters depending on how templated the invoices are but you get the idea.

If I understand your problem you are looking for near duplicates among a very large number of files in multiple formats.

I think that software designed to help filter out documents before examination may help. One such example might be orcatec.com/. I am fairly certain that it (Orcatec) can accept documents in many formats.
_________________Michael Cotgrovewww.cnwrecovery.comwww.goprorecovery.co.uk

I would convert everything to text first, then I may extract the descriptions, part numbers, service numbers, and dump them to a file with reference to the files, row per description, or some other reference.

I would try to massage text into a "fixed" format, trim white space, lower case, etc.
Thereafter grep for the content. If the normalization turns out good, maybe even pivot tables could be used.

Once you can identify a pattern this will help you build a search that can produce similar results

Next, write regular expressions that cull the data and help you refine similarities. Yes, they are a bit difficult to write but they are consistently the most accurate and fundamental way to search text data.

if you are not comfortable writing your own Reg Ex, you can freelance this work. Then verify they work correctly.

I would recommend starting with general search parameters and narrowing down from there.

For example: find all files with .doc, .pdf, .docx where the customer id is = XXX-XXX and the creation date is within 30 days +/-