How to find all the email related files mixed into an old file archive?

1

Just to be clear, the question is not about accessing archives of emails, but finding email related files mixed into a file archive "randomly" with lots and lots of other files; which is to say, right now my focus is on finding the files, then I'll figure out what to do with them.

Ways I've thought of so far are:

  • Searching the ascii text of a file for "from", then manually review the results,
  • Create a list of all possible email clients during the date range of files present, then create a list of known extension for those clients, search for those extensions, and manually review the results,
  • I don't know, seems like there might be a much easier way to do this, which is why I'm asking.

blunders

Posted 2012-03-07T14:09:28.067

Reputation: 759

It wouldn't hurt to find some way to consolidate all old email into a single client. I managed to get old archives into Gmail once. That makes it easier to search. – William Jackson – 2012-03-07T14:14:41.513

@William Jackson: Just to be clear, the question is not about accessing archives of emails, but finding email related files mixed into a file archive "randomly" with lots and lots of other files; which is to say, right now my focus is on finding the files, then I'll figure out what to do with them. Am I misunderstanding your comment, or is my question not clear? Thanks! – blunders – 2012-03-07T14:18:23.733

I misunderstood your question. Thank you for the clarification! – William Jackson – 2012-03-07T14:20:15.080

Answers

1

You can grep for all files containing the words To:, From: and Subject: at the beginning of a line, which should cover pretty much all emails:

find . -type f -print0 \
  | xargs -0    grep -l '^To:'          \
  | xargs -I{}  grep -l '^From:'    '{}' \
  | xargs -I{}  grep -l '^Subject:' '{}'

This outputs a list of files recursively, starting from the current directory.

Marco

Posted 2012-03-07T14:09:28.067

Reputation: 4 015

+1 @Marco: Makes sense, and at least in the Outlook files I just reviewed in ascii text your approach would work. Do you have experience looking for emails, and if so, is it very uncommon to store the (To,From,Subject) part of the email in binary text; meaning for example, though the Outlook "msg" files were in binary, the (To,From,Subject) were in ascii. Also, while I've used grep a little before, it's not clear to me why the command for "To" is different than the command for "From" and "Subject". Thanks! – blunders – 2012-03-07T14:58:36.197

1@blunders: I don't use Outlook, I use the maildir format which stores the mails in plain text. Applications like this makes you dislike binary (especially proprietary) formats. The grep commands are all the same, what differs is the call of xargs: The first one uses ”null” characters (because of the -print0), the -I{} substitutes the braces present in the command line with the arguments. That means grep will receive the arguments, not the braces. – Marco – 2012-03-07T15:16:24.160

+1 @Marco: Selecting your answer as the answer, since suggesting to filter based on the text following a line-break was a great idea. In my case, after I parsed a few 1000 docs, I decided to change (To:,From,Subject) to (To:,From:,Subject:), since I was getting oddly a lot of false positives; also, made the search not case sensitive. As for Outlook, point of the question is how to find "any" emails, just not ones from a target format; meaning after I find the "all" the email, then I'll figure out the format they are in, and what to do with them. Again, thanks! – blunders – 2012-03-07T17:18:10.733