Remove email (.eml) duplicates

0

I have a folder with around 50.000 emails in .eml format. There are many duplicates, even triplets or quadriplets of them, i suppose around 30.000 in total. I have tried to remove duplicates using Mozilla Thunderbird add-on Remove Duplicate Messages (alternative), but it removed just a small part of them (few hundred). Then, i've used Windows desktop apps, such as Wise duplicate finder, duplicate cleaner free, AllDup, Fast Duplicate finder and Anti-Twin, using Byte by byte (60% comparision) and none of those applications succeded in finding right duplicates (again, i have managed to removed just a part of them, few thousands this time).

I've attached example of two emails that i have, although they have slightly different source code (and different file names), they are basically the same - they have been sent from same email address, in the same time, and they have same file size:

First email - message-1-34437.eml

Received: from e11mailgw02.isp.com ([212.200.12.195]) by mtain3.isp.com (Sun Java(tm) System Messaging Server 6.3-4.01 (built Aug  3 2007; 32bit)) with ESMTP id <0KKM00B5BQ1TKV40@mtain3.isp.com> for user@com; Tue, 02 Jun 2009 22:53:58 +0200 (CEST)
Received: from unknown (HELO vps.mafiascene.com) ([69.73.156.173]) by e11mailgw02.isp.com with ESMTP; Tue, 02 Jun 2009 22:53:57 +0200
Received: (qmail 24030 invoked by uid 48); Tue, 02 Jun 2009 16:53:51 -0400
Date: Tue, 02 Jun 2009 16:53:51 -0400
From: "Mafia Scene" <no-reply@mafiascene.com>
Subject: Mafia Scene Registration Confirmation
To: <user@com>
X-Priority: 3
X-MSMail-Priority: Normal
Importance: Normal
Message-ID: <20090602205351.24028.qmail@vps.mafiascene.com>
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Au0JAFEuJUpFSZyt/2dsb2JhbACOFhEBsRIRCAMEj2iCMR4IBAwEgSAF
X-IronPort-AV: E=McAfee;i="5300,2777,5634"; a="7766158"
X-MimeOLE: Produced By Microsoft MimeOLE V14.0.8089.726
Old-X-EsetId: 4FAA1F2928B4776950AC1F7F23E634
X-EsetId: 745B6128E6F033696B5D617DE9A773
X-EsetScannerBuild: 6455


Thank you for registering with Mafia Scene!



The details you registered your account with at 4:53pm EDT Tuesday - 2nd June 2009 are as follows:

Username: username 
Password: password

To active your account you MUST visit the following link WITHIN the next 24 HOURS.

http://mafiascene.com/modules.php?name=users&action=activate&id=c284c0e0a7a7aec0772709511b2b8f3e

Regards,

The Mafia Scene Staff


__________ Information from ESET NOD32 Antivirus, version of virus signature database 4124 (20090602) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com





__________ Information from ESET NOD32 Antivirus, version of virus signature database 4801 (20100124) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com

Second email - message-1-54557.eml

Received: from e11mailgw02.com ([212.200.12.195])
 by mtain3.isp.com
 (Sun Java(tm) System Messaging Server 6.3-4.01 (built Aug  3 2007; 32bit))
 with ESMTP id <0KKM00B5BQ1TKV40@mtain3.isp.com> for
 user@com; Tue, 02 Jun 2009 22:53:58 +0200 (CEST)
Received: from unknown (HELO vps.mafiascene.com) ([69.73.156.173])
 by e11mailgw02.com with ESMTP; Tue, 02 Jun 2009 22:53:57 +0200
Received: (qmail 24030 invoked by uid 48); Tue, 02 Jun 2009 16:53:51 -0400
Date: Tue, 02 Jun 2009 16:53:51 -0400
From: Mafia Scene <no-reply@mafiascene.com>
Subject: Mafia Scene Registration Confirmation
To: user@com
Message-id: <20090602205351.24028.qmail@vps.mafiascene.com>
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result:
 Au0JAFEuJUpFSZyt/2dsb2JhbACOFhEBsRIRCAMEj2iCMR4IBAwEgSAF
X-IronPort-AV: E=McAfee;i="5300,2777,5634"; a="7766158"
X-EsetId: 4FAA1F2928B4776950AC1F7F23E634


Thank you for registering with Mafia Scene!



The details you registered your account with at 4:53pm EDT Tuesday - 2nd June 2009 are as follows:

Username: username
Password: password

To active your account you MUST visit the following link WITHIN the next 24 HOURS.

http://mafiascene.com/modules.php?name=users&action=activate&id=c284c0e0a7a7aec0772709511b2b8f3e

Regards,

The Mafia Scene Staff


__________ Information from ESET NOD32 Antivirus, version of virus signature database 4124 (20090602) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com

Is there some way to detect such emails as duplicates?

Ljubisa Livac

Posted 2019-09-25T06:51:27.327

Reputation: 103

You must formulate programmable criteria of similarity (cut off all headers then compare is one of possible ways) and apply it to each file, then you may use any available software which searches for duplicates or create your own code. – Akina – 2019-09-25T07:58:40.230

In cases like this where there's an automated message, even grep with a catchy string can bring out a lot of emails. That(or similar) can be used in the methods suggested in the answer. – ankii – 2019-09-25T09:07:33.170

Answers

3

The headers are completely different and their content differs as well. That information isn't discernible by common solutions to find duplicates.

You will have to brew up something of your own. For example you could write a script to extract the information that's relevant for you, mark suspected duplicates and apply some other technique to check whenever it's actually a duplicate. It's probably going to involve manual work to some degree.

An easier, first step could be to just cut off headers and run the compare.

Seth

Posted 2019-09-25T06:51:27.327

Reputation: 7 657

Thanks for taking time to answer! I could definitely write some script for cutting off headers, but then i would come up with another problem - email files would become useless because i wouldn't be able later to search emails by email addresses, subject, sending date etc. – Ljubisa Livac – 2019-09-25T21:17:47.340

Which is why you'd not just discard the files but rather build a checksum for the body, curate a list, find duplicates in that and deleted the associated files. In short: You won't find a ready to go solution for what you need to do as you deem the mails similar but from a computer perspective they are not. You could also build a list with other attributes, find duplicates and check the associated files. – Seth – 2019-09-26T09:44:45.693