This workaround, using Thunderbird, allowed me to successfully remove 30,000 duplicate emails (from a collection of about 80,000 emails) in OS X's Mail.app. I spent a lot of time searching this question, and this is the only solution I found that worked.

My mail.app emails got rather out of hand; I won't bore you with how. I had most emails at least twice, and some up to five times. I tried Andreas Amann's Mail Script for this, but, even though it was working OK, it only found about 200 duplicates in three hours, and was cooking my CPU at about 95%. There was no way that this could go on, so I cancelled the process. (Thanks anyway, AA.)

I looked at importing into Entourage, because there are some scripts to eliminate duplicates from there, but for reasons I shall not bore you with here, this proved a dead end.

The solution turned out to be the amazing add-on for Mozilla's Thunderbird called Remove Duplicate Messages (ALTERNATE). I had to use version 3.0 of Thunderbird, not the current version 3.01 (so thanks to those reviewers who reported that the add-on version 0.3.3 was failing with Thunderbird 3.01). I found the older version on the Thunderbird releases page. Below is the process I used. (It took several hours, because of the number of emails. Ironically, every stage except the actual analysis of duplicates takes ages. This makes using Thunderbird permanently quite tempting.)

Note: The following process needs to be done for each folder that resides in the On My Mac section of Mail. (I will have more to say about that soon.)

Use Mailbox » Archive Mailbox in Mail.app to create a proper mbox export file of the mail folder you want to de-duplicate. (The mailboxes that Mail keeps in the Mail folder inside your user's Library, with their extension .mbox, are not real mboxes.)

Follow these instructions to import your newly-created mbox file quickly and easily into Thunderbird. I think my use of Path Finder (instead of Finder) helped a bit here. After I did this, I did wait a long time for the Spotlight indexing in Thunderbird to finish -- not sure whether this was necessary, but I suspect it probably was.

Install the add-on mentioned above. Set its prefs for email matching criteria (in Thunderbird » Prefs » Manage Add-Ons) according to what works for you. I ran some tests with a small collection of dupes until I had these right. What worked for me was ticking Message ID plus a few others, but unticking Size, Lines, CC and Body. This did an excellet job of correctly collating the two to five copies of each email.

Move the dupes to a chosen folder (e.g. trash). With each of my folders of about 40,000 emails, I had to wait roughly 30 seconds for the dialog box to appear. Then it took just two minutes to move about 13,000 dupes to the 'unneeded duplicates' folder I created. Amazing.

I did hit one snag, because I had 80,000 emails in a single folder in Mail (I had moved them into one folder in the hope that I could run Andreas Amann's script all night. Like the Thunderbird add-on, the Mail script searches for dupes inside one folder at a time.)

Mail.app (on both my attempts) only created an archive mbox of about 4.3GB, which comprised some 43,000 emails; the remaining 36,000+ didn't make it into the archive! Luckily, I found when I had imported these into Thunderbird that they were in date order, and that no emails were dated earlier than 27Oct07. So I went back to Mail.app and created a new folder, then dragged all the pre-27Oct07 emails (the remaining 36,000+) into it, and created another Archive mbox. I then repeated the import process and everything worked.

Interestingly, it took close to half an hour in Mail.app for my MacBook Pro (Core2Duo at 2.2GHz) to even select those 36,000 emails -- be patient while the wheel spins! It then took nearly another hour to move them to the new folder and index them and their 6,000 attachments. (Again, I'm not sure that I really needed to wait for that indexing, but whatever.) Still, it was all wonderfully stable.

So, as the last step, I have reimported everything into Mail.app (File » Import mailboxes » files in mbox format). But I'm thinking of experimenting with Thunderbird as well, given the third-party geniuses who write extensions for it. It seems very zippy.

Finally, I get tired of reading blogs that tell me I don't need past emails. In my job (I'm a senior high school English teacher), I need them all the time. I write substantive replies to student X's questions, then rehash them for student Y, maybe years later. (It's much more complicated than that even, but you see my point!) One day I will delete the thousands I don't need, too.

I use Mac OS X's Mail app in conjunction with Gmail. Gmail automatically takes care of duplicates. Now, when an e-mail is labeled in Gmail and shows up as a duplicate in Mail because the two pieces are in a separate folders for example, I guess Mail actually has duplicates, but those are duplicates that you wanted anyway.

tell application "Mail"
activate
repeat 500 times
set theSelection to selection
set theMessage to item 1 of theSelection
set subj to subject of theMessage
set recip to the recipients of theMessage
set dats to date sent of theMessage
set datr to date received of theMessage
set theid to the message id of theMessage
set siz to message size of theMessage
tell application "System Events"
key code 125 -- down arrow
end tell
set messagechanged to false
repeat until messagechanged
delay 0.25
set theSelection2 to selection
if (the (count of theSelection2) is equal to 0) then
--display dialog "empty selection"
set messagechanged to true
say "skipping message"
else
set theMessage2 to item 1 of theSelection2
set subj2 to subject of theMessage2
set recip2 to the recipients of theMessage2
set dats2 to date sent of theMessage2
set datr2 to date received of theMessage2
set theid2 to the message id of theMessage2
set siz2 to message size of theMessage2
if (theid2 is equal to theid and siz2 is equal to siz) then
tell application "System Events"
keystroke "x" using {command down}
end tell
else
set messagechanged to true
beep
end if
end if
end repeat
end repeat
end tell
tell application "Script Editor" to activate

(*
Select Dups
Devin Bayer (http://t-0.be) - 2010
To use:
1. Select messages in Mail.app
2. Uncheck "Organize by Thread"
3. Run this script
4. Only duplicate message will be selected
---- Performance ----
On my MacBook Pro 2.3Ghz, I can scan
about 5000 messsages a minute.
Only run this script using AppleScript Editor.
When run standalone or in Mail.app, the speed
(and CPU usage) is drastically reduced
---- Notes ----
If you want to use mail while this script is running,
please create a second message viewer window to work in.
*)
using terms from application "Mail"
-- track the duplicate messages
set dups to a reference to {}
global dups
set view to first message viewer of application "Mail"
global view
on progress(txt)
display dialog txt ¬
giving up after 1 with icon note
end progress
-- return a list of emails in msg
to rcpt(msg)
set emails to {}
repeat with email in recipients in msg
copy address of email to the end of emails
end repeat
return emails
end rcpt
-- compare two messages
to compare(l, r)
if r = none or l = none ¬
or message size of l ≠ message size of r ¬
or subject of l ≠ subject of r ¬
or my rcpt(l) ≠ my rcpt(r) ¬
then return false
-- l and r are equal; mark one as a dup
--set background color of r to red
--set flagged status of r to true
copy r to end of dups
end compare
-- set selected messages to dups
on finish()
if (count of dups) < 1 then return true
try
set selected messages of view to dups
on error number -1712
display alert "TIMEOUT"
return false
end try
return true
end finish
my progress("retreiving list of selected messages")
set sort column of view to size column
set msgs to get selected messages of view
set total to count of msgs
-- initialize state
set prev to none
set pos to 0
set failed to 0
-- scan every message
repeat with msg in msgs
try
with timeout of 1 second
compare(msg, prev)
end timeout
on error number -1712
set failed to failed + 1
end try
set prev to msg
-- progress dialog
if pos mod 10000 = 0 then
my progress("Processing message " & pos & " of " & total)
end if
set pos to pos + 1
end repeat
set ok to false
repeat while not ok
display dialog "Done scanning! total: " & total ¬
& " timeouts: " & failed ¬
& " dups: " & (count of dups) ¬
& ¬
" (click OK to select dups)" giving up after 60
set ok to my finish()
end repeat
return true
end using terms from

It occurred to me that this would be best handled using a rule action. That way, messages could be tested as they arrived, automatically, as well as in bulk. the rule action script looks like this (untested):

using terms from application "Mail"
on perform mail action with messages theMessages for rule theRule
tell application "Mail"
repeat with thisMessage in theMessages
set theAccount to account of mailbox of thisMessage
tell theAccount
tell mailbox "INBOX"
set theDupList to (every message whose ¬
message size = message size of thisMessage ¬
and subject = subject of thisMessage ¬
and recipients = recipients of thisMessage)
if (count of theDupList) > 1 and thisMessage ≠ last item of theDupList then
set background color of thisMessage to purple
-- delete thisMessage
end if
end tell
end tell
end repeat
end tell
end perform mail action with messages
end using terms from

because this is an untested version, I have it set to mark the emails in purple, and I've commented out the delete line (though ultimately you would want to uncomment that and delete emails automatically - test to make sure the script works as you want, first). This script will take the current email, check to see if there are any other emails that have the same message size, subject line, and to recipients in the INBOX of the same account, and mark/delete that email if there are duplicates (unless this email is the oldest email in the matching emails - oldest emails are preserved so that at least one copy of the email remains).

to use this, copy the script in the the applescript editor and save it as a script file. In Mail, set up a rule action that calls this script. the script will then run automatically on incoming emails, or you can run it on a given message or an entire mailbox using the Message -> Apply Rules menu item.

Assuming mbox format (such as used on some IMAP servers, or from the "Archive Mailbox" in Mail.app, save this as something like /usr/local/bin/mbox-removedup.sh
then run it on the mbox in a terminal:
sh /usr/local/bin/mbox-removedup.sh ~/Inbox

The "formail" program is a part of the "procmail" package, which dates back to 1990. Since OS X is based (in part) on Unix, procmail is something that has been included by default as part of the operating system for as long as I can remember.

Note that this example will delete duplicates based exclusively on the content of the Message-id: header, which is SUPPOSED to be unique for each and every message, but in some cases is not. This is part of why the procmailex man page suggests keeping a "duplicates" mailbox which you can go through manually to see if there are any messages which were mistakenly believed to be duplicates. This is also part of why other AppleScript examples you may have seen will check more than just the value of the Message-id: header.