Removing duplicates from SMS Backup & Restore XML backup

I've been using this app to backup all my SMS. Wanna take a guess at how many SMS messages I have backed up right now? 30,000? Go higher. 50,000? Higher. I have over 70,000 SMS entries in my backup. And that's after using this clean up Python script I made to filter out duplicates!

The Problem

I use SMS B&R for backing up SMS messages when I'm moving between ROMs because every ROM change, a data wipe needs to happen. I like to keep all my messages archived and intact because I'm sentimental like that. Maybe I configured it wrong or something but it got to a point where the SMS entries got duplicated and appeared twice, four times, or eight times. Hmmm. Power of two. Obviously a backup-and-restore syndrome. Duplicating twice everytime. Not only does this consume a lot of SMS byte space but it also consumes a lot of screen space. When I check back on old SMS to retrieve information, I have to scroll longer than usual because the SMS thread is made much longer by one SMS entry appearing eight times. My message, 8 times. Your reply, 8 times. My reply to your reply, 8 times. I'm sure you get the flow.

The problem isn't in the restore process but probably in the backup process. I opened up the backup output XML file (which is neatly formatted by the way) and saw the duplicates in there. Yeahp. Must have been a configuration issue or app bug where I kept backing up the same SMS entries into the XML file where those SMS entries already existed. Might have happened like this:

SMS A and B are in my messages

SMS B&R saves them into backup.xml

SMS B&R restores them into my messages

There are now two A & B SMS entries in my messages

Rinse & repeat

This is how it grows by a factor of two. I think there's a check for duplicates option during restore but I skipped it? I forgot. I haven't used SMS B&R since December 2013 because I haven't moved to a new ROM. I'm satisfied so far with ProBAM (or now called AOSB). More on this ROM in another post. Anyway, that's the problem. Duplicates. Duplicates everywhere. Not just one copy but up to 7 times. *facepalm*

The Solution

Like the dork that I am, I wanted to solve this with code. And I did. You can try it out yourself if you had the same problem I did with SMSB&R. Check the AndroidSMSBackupRestoreCleaner Github repo. And what follows are the gory details!

My first approach was to manually parse the XML, do the filtering out myself but like the lazy bastard that I am, I resolved to using SQLite's index columns to filter out duplicates. You see, the XML backup file has a epoch date (date) and number (address) attribute in every SMS entry.

I used that as my compound primary key in my SQLite database because the same number can't be sending at the same millisecond. Right? If this is false, I may have lost many messages. Ha. Whatevs.

So I simply used Python's built-in xml.etree.ElementTree to read the dirty backup XML file and loaded each <sms> entry into the database. SQLite did the filtering of duplicates for me by rejecting entries that have the same number and date (the compound primary key I set). What came after that was simply to write all the entries in the SMS database back into a new XML file, and update the SMS count at the root element. Voila. The lazy bastard wins.

Update Aug 18 2016: Hameer Abbasi has expanded the script. I haven't tested it but it looks like emoji support is there and he has added some other nice features! Check out his Git repo.

Update Mar 3 2016: A kind reader by the username VenomVendor has shared his open source, minimal permissions app solution to this. I haven't used it myself but how about you give Deduplicate SMS a try and let us know :)