How do I convert UTF-8 special characters in Bash?

6

2

I am writing on a script that extracts and saves JPEG-attachements from emails and passes them to imagemagick. However, I am living in Germany and special characters in email text/subject as "ö", "ä", "ü" and "ß" are pretty common.

I am extracting the subject with formail:

    SUBJECT=$(formail -zxSubject: <"$file")

and that results in:

  • =?UTF-8?Q?Meine_G=c3=bcte?=

("Meine Güte") or even worse

  • =?UTF-8?B?U2Now7ZuZSBHcsO8w59lIQ==?=

("Schöne Grüße!").

I try to use part of the subject as a filename and as imagemagick text annotation, which obviously doesn't work.

How do I convert this UTF-8 text to text with special characters in bash?

Thanks in advance! Markus

Markus

Posted 2015-09-12T23:16:20.093

Reputation: 239

Spaces between words and punctuation are missing as well :( – Markus – 2015-09-12T23:44:44.587

Answers

9

How do I convert this UTF-8 text to text with special characters in bash?

What you have isn't quite "UTF-8 text". You actually want plain UTF-8 text as output, as it's what Linux uses for "special characters" everywhere.

Your input, instead, is MIME (RFC 2047) encoded UTF-8. The "Q" marks Quoted-Printable mode, and "B" marks Base64 mode. Among others, Perl's Encode::MIME::Header can be used to decode both:

#!/usr/bin/env perl
use open qw(:std :utf8);
use Encode qw(decode);

while (my $line = <STDIN>) {
        print decode("MIME-Header", $line);
}

Oneliner (see perldoc perlrun for explanation):

perl -CS -MEncode -ne 'print decode("MIME-Header", $_)'

This can take any format as input:

$ echo "Subject: =?UTF-8?Q?Meine_G=c3=bcte?=, \
                 =?UTF-8?B?U2Now7ZuZSBHcsO8w59lIQ==?=" | perl ./decode.pl
Subject: Meine Güte, Schöne Grüße!

A version in Python 3:

#!/usr/bin/env python3
import email.header, sys

words = email.header.decode_header(sys.stdin.read())
words = [s.decode(c or "utf-8") for (s, c) in words]
print("".join(words))

user1686

Posted 2015-09-12T23:16:20.093

Reputation: 283 655

The one-liner didn't work for me. (see @blami ). But the pl code worked fine. When I used the last script, the Base64 part wasn't encoded at all. Not even when I tried only that part. Your explanation helped me as well. – Markus – 2015-09-13T01:16:35.690

6

E-mail subject itself is header and headers must contain only ASCII characters. This is why UTF-8 (or any other non-ASCII charset) subject must be encoded.

This way of encoding non-ASCII characters in to ASCII is described in RFC 1342.

Basically, encoded subject has (as you've already listed in your examples) following format:

=?charset?encoding?encoded-text?=

Based on encoding value is encoded-text decoded either as quoted-printable (Q) or as base64 (B).

To get human readable form you need to pass encoded-text portion of subject header value to program that decode it. I believe there are some standalone commands to do that (uudecode), but I prefer to use Perl one-liners:

For quoted-printable:

perl -pe 'use MIME::QuotedPrint; $_=MIME::QuotedPrint::decode($_);'

and for base64:

perl -pe 'use MIME::Base64; $_=MIME::Base64::decode($_);'

Be sure you pass only encoded-text portion and not whole subject header value.

blami

Posted 2015-09-12T23:16:20.093

Reputation: 276

Thank you for your comment! I learned a lot from your detailed explanation. However, i couldn't get the one-liners to run properly (in putty on raspberry pi). But I'm sure this is only because of my poor linux skills. I used echo "=?UT...:" | perl ... and the result was "=?UTF-8?Q?Meine_Güte?". Only the "ü" was translated. @grawity 's provided .pl script has produced good results. – Markus – 2015-09-13T01:10:08.450