Why is the size of my email about a third bigger than the size of its attached files?

111

10

When attaching data to my emails, I noticed that Thunderbird calculates the total size of the resulting email as much bigger than the files I attached.

Here's a recent example: two images, one at 13MB and one at 3.6MB should in total be approximately 17MB. There were four lines of text. Thunderbird then asked me if I really wanted to send an email with a total size of 22MB.

Where is that difference coming from? 5MB of text sounds like a bit much.

arc_lupus

Posted 2016-10-26T20:45:07.923

Reputation: 1 177

2Note that this often affects things like maximum size. If I'm not mistaken Google mail usually allows email of at most 25MB, but the 25MB are computed after encoding, so you cannot send a 25MB image with an email, because when encoded it would actually be too big. – Bakuriu – 2016-10-27T06:26:20.267

4@Bakuriu's comment applies to Outlook+Exchange server as well. I suggest that the underlying question is actually Why do mail clients (often -- Tbird seems better than outlook again) report only the local file size when it's the base64-encoded size that matters? – Chris H – 2016-10-27T10:00:39.220

@MarcksThomas I don't want to argue against the appeal of having one all-including easily searchable source of knowledge against just having all knowledge easily searchable. But is it necessary? I don't think so. - I don't think that the question isn't useful at all, I just think it doesn't fulfil the basic requirements to keep the site free of unnecessary questions and makes it harder to find the really important stuff, that isn't answered anywhere else. That's what we should be doing! - arc_lupus, as I only lurk on this site, usually, my downvote doesn't cout, yet. But as it is, it stands. – I'm with Monica – 2016-10-28T06:33:48.173

Answers

214

Your data was 17 MiB. There are 1024 KiB in an MiB. There are 1024 B in a KiB. There are 8 bits in a byte. So that's 142,606,336 bits.

Base 64 encoding encodes every six bits as a separate byte. So we need about 23,767,722 bytes. Dividing by 1024 twice gets us 22.67 MiB. So that's where the 22 MiB comes from.

Email is a pretty old technology and doesn't assume an 8-bit clean pipe.

David Schwartz

Posted 2016-10-26T20:45:07.923

Reputation: 58 310

80To decode that last line a little: base-64 is a way to encode attachments as text using a limited set of "guaranteed safe characters" that wouldn't get garbled by some intermediary equipment, such as a-z, A-Z, 0-9 – Yorik – 2016-10-26T21:53:43.590

65And, once you understand the math in David's excellent answer, you can just multiply the size of the attachments by 4/3 to get the size of the mail message that will be sent (plus the actual text). – Kent – 2016-10-26T23:13:51.453

12Even if email knew it has a full 8 bit pipe there would have to be encoding as it's fundamentally a text stream--some characters serve control functions and thus must not occur in your data. That being said, there are better encoding techniques but they haven't been adopted. – Loren Pechtel – 2016-10-27T04:24:36.303

4@LorenPechtel you can happily have an application/octet-stream part in a MIME message. All you have to do is choose a boundary that doesn't occur in the data. – OrangeDog – 2016-10-27T08:51:13.563

@Mehrdad I was saying you're both right: Copper.hat was saying the error checking/fixing happens at a higher level than the physical exchange of bytes (which it is) and you're saying it's a lower level than MIME-encoding/mail-transfer format (which it is). – TripeHound – 2016-10-27T15:16:10.557

9what base64 actually does, is using 4 bytes for every 3 original bytes. While this sounds similar, it is important because the length is always a multiple of 4, and also because there is no reason to the bit level. – njzk2 – 2016-10-27T15:44:25.487

1@Mehrdad Email doesn't actually have a binary representation, thus the need to re-encode binary data as text (a la base64). – jpaugh – 2016-10-27T18:14:42.193

In principle if you can send 8BITMIME, you could use an encoding that's a lot more efficient than base64 with something like 7.5 or more bits of binary data per byte (rather than 6 bits). You can't send pure binary because it has to be valid text, but you can get close to the same efficiency. – R.. GitHub STOP HELPING ICE – 2016-10-28T02:43:59.687

1Unfortunately there is no standard defining an encoding that will efficiently encode arbitary binary data as a data stream that follows the rules of MIME "8bit". There IS a SMTP extension for transmitting mails containing abritary binary data but it does not seem to be widely supported. – plugwash – 2016-10-28T03:04:05.917

1@njzk2 Base64 encoded data is always an integer multiple of 4 bytes, except when it is not. Particularly, the end padding is optional in many implementations. – a CVn – 2016-10-28T14:12:46.450

@njzk2 https://tools.ietf.org/html/rfc4648#section-3.2 "Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise." (My emphasis.) Not sure if Internet e-mail allows or does not allow lack of padding.

– a CVn – 2016-10-28T14:42:53.207

There still is no "8 bit clean pipe", an SMTP server will interpret what is sent to it over a TCP connection, at the very least looking for an end sequence (single dot or control-D) to a DATA command. So at the very least, an escape protocol to keep all valid such sequences out of the binary data will be needed. – rackandboneman – 2016-11-01T17:19:03.617

@rackandbonemane the CHUNKING/BINARYMIME standard solves that problem by introducing the BDAT command which includes a length header. So the binary message data does not have to be scanned for an end sequence. – plugwash – 2016-11-04T15:19:20.103

50

Why is the email bigger?

Because the data is encoded in base64 which encodes groups of up to three bytes as groups of four printable ASCII characters. Typically, these groups of printable characters are then split into lines.

The result is that the encoded data is just over 1⅓ times the size of the original data.

Why is base64 used?

Email has a long history and was originally designed to carry text. Only byte values representing ASCII printable characters could reliably pass through the wide variety of email systems on the planet.

So MIME divised two schemes for encoding other data as ASCII text - "quoted-printable" designed for mostly ASCII text with a few other bits, and "BASE64" for arbitrary binary data.

There have been extensions to the SMTP protocol to try and remove these restrictions. First, 8BITMIME in 1994, which allowed higher octet values but unfortunately didn't remove limits related to line lengths and line endings, so was not suitable for arbitrary binary data; and then BINARYMIME in 1995, which allowed transfer of messages containing arbitrary binary data.

However, these standards have not seen widespread adoption. One problem is, what happens if one hop in the mail chain supports them but the next hop doesn't? The mail server then can't send the mail on as-is, it must either reject it as undeliverable and bounce it (which is unlikely to be acceptable to users), or convert it (which requires significant extra code in the mail server). Conversion is made especially painful by MIME rules regarding not using content transfer encodings on multipart types.

plugwash

Posted 2016-10-26T20:45:07.923

Reputation: 4 587

1I wonder why yEnc, on the other hand, was quite successful in Usenet at displacing UUE. Possibly because binary newsgroups put a much higher pressure on ISPs than an occasional binary email? – igorsk – 2016-10-30T19:09:44.140

2@igorsk: plus Usenet/NN was presented and understood as lossy, where you could publish an article and not all subscribers on all servers would necessarily receive it. There were (and largely remain) customs about quoting in a followup 'enough' of the previous article(s) that your followup can be understood by someone who didn't get the previous article(s). In contrast most (nonspammer) email senders expected 'the system' would get their message to the named recipient(s), although sometimes after hours or days; today people complain about even short delays. – dave_thompson_085 – 2016-10-31T22:35:29.650