Why do English characters require fewer bytes to represent than other alphabets?

31

5

When I put 'a' in a text file, it makes it 2 bytes but when I put, let's say 'ա', which is a letter from Armenian alphabet, it makes it 3 bytes.

What is the difference between alphabets for a computer?
Why does English take less space?

khajvah

Posted 2014-04-11T18:07:37.283

Reputation: 750

22

You should read this article by the founder of StackExchange: http://www.joelonsoftware.com/articles/Unicode.html

– Eric Lippert – 2014-04-11T20:44:20.620

22I don't think there is such a thing as "English characters". They are Roman. – Raphael – 2014-04-12T09:56:36.387

5@Raphael everybody knows what he is referring to though. But nice add. – Mathias Lykkegaard Lorenzen – 2014-04-12T12:09:15.057

Your problem is that you are using UTF-16 or something and not the better, more space-saving UTF-8. – Cole Johnson – 2014-04-12T20:40:09.493

1@Raphael Actually there are many Roman letters that aren't used in English, and thus aren't included in the ASCII character set. Most of them include modifiers, but those are still needed to properly render text in various Latin-derived laguages other than English. – Wutaz – 2014-04-13T15:48:43.623

7@Raphael I don't think there is such a thing as “Roman characters”. They are Latin. – Blacklight Shining – 2014-04-13T20:47:08.990

@Raphael What about 'merican characters? – Martin Ueding – 2014-04-14T08:26:29.537

Answers

41

One of the first encoding schemes to be developed to use in mainstream computers is the ASCII (American Standard Code for Information Interchange) standard. It was developed in the 1960's in the United States.

The English alphabet uses part of the Latin alphabet (for instance, there are few accented words in English). There are 26 individual letters in that alphabet, not considering case. And there would also have to exist the individual numbers and punctuation marks in any scheme that pretends to encode the English alphabet.

The 1960's was also a time where computers didn't have the amount of memory or disk space that we have now. ASCII was developed to be a standard representation of a functional alphabet across all American computers. At the time, the decision to make every ASCII character to be 8 bits (1 byte) long was made due to technical details of the time (the Wikipedia article mentions the fact that perforated tape held 8 bits in a position at a time). In fact, the original ASCII scheme can be transmitted using 7 bits, the eight could be used for parity checks. Later developments expanded the original ASCII scheme to include several accented, mathematical and terminal characters.

With the recent increase of computer usage across the world, more and more people from different languages had access to a computer. That meant that, for each language, new encoding schemes had to be developed, independently from other schemes, which would conflict if read from different language terminals.

Unicode came as a solution to the existence of different terminals, by merging all possible meaningful characters into a single abstract character set.

UTF-8 is one way to encode the Unicode character set. It is a variable-width encoding (e.g. different characters can have different sizes) and it was designed for backwards compatibility with the former ASCII scheme. As such, the ASCII character set will remain to be one byte big whilst any other characters are two or more bytes big. UTF-16 is another way to encode the Unicode character set. In comparison to UTF-8, characters are encoded as either a set of one or two 16-bit code units.

As stated on comments, the 'a' character occupies a single byte while 'ա' occupies two bytes, denoting a UTF-8 encoding. The extra byte in your question was due to the existence of a newline character at the end (which the OP found out about).

Doktoro Reichard

Posted 2014-04-11T18:07:37.283

Reputation: 4 896

@Damon Can you elaborate on why it's a very poor choice for computer text processing. – Milind R – 2014-12-13T16:27:02.583

1@MilindR: hard to fit into 600 chars... Unicode contains a lot of crap that nobody will ever need (do you speak Babylonian?), and it encodes a lot of crap that nobody will seriously need (Klingon, really? Numbers in circles? ANSI control codes?), some of which are in low numbers, making UTF-8 considerably less efficient for roman languages than it could be (at no extra cost). Also, it allows a considerable number of symbols being encoded in two or more ways (e.g. accented/umlauted characters). This requires considerable work ("normalization") that would actually not be necessary. – Damon – 2014-12-13T16:43:33.997

1Unicode makes the assumption that you may need every character that any human since the stone age has ever drawn at any time, as special characters. 2/3 of that could be solved easier, better, and more efficiently by using a different font or formatting hints (like, numbers in circles, or superscript numbers). It certainly "works", somehow, but it's wrong-headed on so many ends. – Damon – 2014-12-13T16:46:13.093

@Damon normalization is the natural consequence of an evolving standard. Numbers in circles, superscript numbers... well, I have to agree with you. Still, it seems UTF-8 is more at fault than anything else.. The code points aren't unworthy of existing, just unworthy of precious 7-bit space. On that note : http://programmers.stackexchange.com/questions/266292/how-do-i-create-a-new-unicode-encoding

– Milind R – 2014-12-13T17:16:42.180

-1 for me for sounding sloppy withfirst ... in mainstream (sorry my bad mood... hope we can do better) – n611x007 – 2017-01-17T12:25:11.370

The last byte codes the end of file. – Joce – 2014-04-11T18:54:32.717

That would make sense actually... although I don't see its effects in Notepad. – Doktoro Reichard – 2014-04-11T18:57:05.517

1Without the last byte, Notepad (or any other tool) wouldn't know when to stop reading from the storage medium. But end-of-file is not shown in such tools, of course. – Joce – 2014-04-11T19:07:31.527

26There is no last byte that codes the end of file, in any normal encoding or file format. When a program reads a file, end of file might be signalled by the OS in a special way, but that’s a different issue. – Jukka K. Korpela – 2014-04-11T19:09:30.317

I use linux if it helps. – khajvah – 2014-04-11T19:13:40.137

@Joce I understand that the EOF char isn't represented in Notepad; what I was referring to was Windows Explorer's representation of the size the file had, which by writing a char was at 1 byte. It would mean Explorer specifically forgets about the null char. – Doktoro Reichard – 2014-04-11T19:15:03.547

2

The ա character is 2 bytes (0xD5A1) in the UTF-8 version of unicode; the extra character (whatever is is) is present in both files. http://www.marathon-studios.com/unicode/U0561/Armenian_Small_Letter_Ayb

– Dan is Fiddling by Firelight – 2014-04-11T19:23:22.483

6@khajvah If you echo 'ա' > file.txt it, or edit the file using some editors, they automatically add a newline after it. If you run xxd file.txt, the last byte will probably be a 0a, or line feed. – Daniel Beck – 2014-04-11T19:31:38.123

1@DanielBeck Yes, that's the case. echo added new line in the end – khajvah – 2014-04-11T19:35:53.767

7@DoktoroReichard: Please clarify in the answer that Unicode is not an encoding; rather, it's an abstract character set, and UTF-16 and UTF-8 are encodings of Unicode codepoints. The last paragraphs of your answer mostly talk about UTF-8. But if a file uses UTF-16, then any codepoint, even the one for a, will use two bytes (or a multiple of two). – user1686 – 2014-04-11T20:22:46.780

6It's also probably worth emphasizing that the "extended ASCII" character sets are in fact not ASCII at all, and the number of different ways to utilize the eighth bit makes it all a big mess. Just use UTF-8 instead. – ntoskrnl – 2014-04-11T21:01:32.507

1

@ntoskrln The extensions that IBM (and many many others) made to the ASCII standard came about as a need to represent more things that the characters already present weren't able to, on the terminals at the time. Also, several European countries still use the provided character sets, despite the existence of Unicode.

– Doktoro Reichard – 2014-04-11T21:08:50.533

2ASCII proper is 7 bits, not 8 – mpez0 – 2014-04-12T00:12:56.800

The paragraph about Unicode is wrong, though. Unicode is not a solution to the existence of different terminals. On the contrary, Unicode is entirely unsuitable for what it is being used for. It is is not a character set, but a grapheme encoding (for "anything man has ever written"), which includes graphemes of languages that no living person speaks and multiple ambiguous encodings for the same graphemes. This makes it an extremely poor choice for computer text processing, introducing many twists and pitfalls, and significant overhead (for such things as e.g. "normalization"). – Damon – 2014-04-14T09:20:31.007

@Damon it is a solution. I never said it was the best. – Doktoro Reichard – 2014-04-14T19:10:31.080

17

1 byte is 8 bits, and can thus represent up to 256 (2^8) different values.

For languages that require more possibilities than this, a simple 1 to 1 mapping can't be maintained, so more data is needed to store a character.

Note that generally, most encodings use the first 7 bits (128 values) for ASCII characters. That leaves the the 8th bit, or 128 more values for more characters . . . add in accented characters, Asian languages, Cyrillic, etc, and you can easily see why 1 byte is not sufficient for keeping all characters.

ernie

Posted 2014-04-11T18:07:37.283

Reputation: 5 938

so here is the only answer actually explaining why more space is used – Félix Gagnon-Grenier – 2014-04-12T00:19:05.943

10

In UTF-8, ASCII characters use one byte, other characters use two, three, or four bytes.

Jason

Posted 2014-04-11T18:07:37.283

Reputation: 5 925

1Can you elaborate on why this is? noting two encoding methods doesn't quite answer the question. – MaQleod – 2014-04-11T18:31:06.103

@MaQleod Unicode was created to replace ASCII. For backwards compatibility, the first 128 characters are the same. These 128 characters can be expressed with one byte. Additional bytes are added for additional characters. – Jason – 2014-04-11T18:45:41.943

I'm aware, but that is part of of the answer to the question as to what makes the ASCII characters different. It should be explained to the OP. – MaQleod – 2014-04-11T18:50:53.070

@MaQleod It could also be said that the Unicode Consortium was mostly comprised of American corporations and were biased towards English language characters. I thought a simple answer was better than an subjective one. – Jason – 2014-04-11T18:55:05.757

No American bias. Unicode is an extension of ISO-8859-1; the first 256 characters are the same. In turn, ISO-8859-1 is an extension of ASCII because most of Europe needed ASCII as a subset. – MSalters – 2014-04-11T23:15:11.653

15Not "in Unicode", in UTF8 - which is just one of several encodings of the Unicode character set. – Sebastian Negraszus – 2014-04-12T09:15:07.833

This answer isn't even accurate. In UTF-16 encoded Unicode (like C# and Java use) most characters, including the original ASCII set, take up 2 bytes, which very obscure characters take up 4. – KutuluMike – 2014-04-12T14:35:14.937

3

The amount of bytes required for a character (which the question is apparently about) depends on the character encoding. If you use the ArmSCII encoding, each Armenian letter occupies just one byte. It’s not a good choice these days, though.

In the UTF-8 transfer encoding for Unicode, characters need a different number of bytes. In it, “a” takes just one byte (the idea about two bytes is some kind of a confusion), “á” takes two bytes, and the Armenian letter ayb “ա” takes two bytes too. Three bytes must be some kind of a confusion. In contrast, e.g. Bengali letter a “অ” takes three bytes in UTF-8.

The background is simply that UTF-8 was designed to be very efficient for Ascii characters, fairly efficient for writing systems in Europe and surroundings, and all the rest is less efficient. This means that basic Latin letters (which is what English text mostly consists of), only one byte is needed for a character; for Greek, Cyrillic, Armenian, and a few others, two bytes are needed; all the rest needs more.

UTF-8 has (as pointed out in a comment) also the useful property that Ascii data (when represented as 8-bit units, which has been almost the only way for a long time) is trivially UTF-8 encoded, too.

Jukka K. Korpela

Posted 2014-04-11T18:07:37.283

Reputation: 4 475

Thank you for the answer. Additional bytes are because the program I used automatically added new line character to the end. – khajvah – 2014-04-11T19:37:46.397

1

I don't think UTF-8 was so much designed for efficiency with ASCII data as for compatibility. UTF-8 has the very nice property that 7-bit ASCII content (with the high bit set to zero) is identical to the same content encoded as UTF-8, so for tools that normally deal with ASCII, it's a drop-in replacement. No other Unicode encoding scheme has that property, to my knowledge. UTF-8 is also reasonably compact for most data, particularly if you stay within the realm of the Unicode BMP.

– a CVn – 2014-04-13T13:00:50.063

1@MichaelKjörling, I’ve added a reference to that feature. However, a major objection to Unicode in the early days was inefficiency, and UTF-16 doubles the size of data that is dominantly Ascii. UTF-8 means, e.g. for English text, that you only “pay” for the non-Ascii characters you use. – Jukka K. Korpela – 2014-04-13T15:28:21.977

3

Character codes in the 1960es (and long beyond) were machine-specific. In the 1980s I briefly used a DEC 2020 machine, which had 36 bit words, and 5, 6 and 8 (IIRC) bits per character encodings. Before that, I used an IBM 370 series with EBCDIC. ASCII with 7 bits brought order, but it got a mess with IBM PC "codepages" using all 8 bits to represent extra characters, like all sorts of box drawing ones to paint primitive menus, and later ASCII extensions like Latin-1 (8 bit encodings, with the first 7 bits like ASCII and the other half for "national characters" like ñ, Ç, or others. Probably the most popular was Latin-1, tailored to English and most european languages using Latin characters (and accents and variants).

Writing text mixing e.g. English and Spanish went fine (just use Latin-1, superset of both), but mixing anything that used a different encodings (say include a snippet of Greek, or Russian, not to mention an asian language like Japanese) was a veritable nightmare. Worst was that Russian and particularly Japanese and Chinese had several popular, completely incompatible encodings.

Today we use Unicode, which is cupled to efficient encodings like UTF-8 that favor English characters (surprisingly, the encoding for English letters just so happen to correspond to ASCII) thus making many non-English characters use longer encodings.

vonbrand

Posted 2014-04-11T18:07:37.283

Reputation: 2 083

2

Windows 8.1 US/English File with a single 'a' saved with notepad.

  • Save AS ANSI 1 byte
  • Save AS Unicode 4 bytes
  • Save AS UTF-8 4 bytes

File with a single 'ա' saved with notepad

  • Save AS ANSI not possible
  • Save AS Unicode 4 bytes
  • Save AS UTF-8 5 bytes

A single 'a' is encoded as a single byte in ANSI, in Unicode each character is usually 2 bytes there is also a 2 byte BOM(Byte Order Marker) at the beginning of the file. UTF-8 has a 3 byte BOM and the single byte character.

For the 'ա' that character does not exist in the ANSI character set and can't be saved on my machine. The Unicode file is the same as before, and the UTF-8 file is 1 byte larger as the character takes 2 bytes.

If your machine is from a different region you may have a different OEM code page installed which has different glyphs for the 255 characters possible in the ASCII range. As @ntoskrnl mentioned the OEM codepage for my machine would be Windows-1252 which is the default for US English.

Darryl Braaten

Posted 2014-04-11T18:07:37.283

Reputation: 121

4Notepad (and Windows in general) uses confusing terminology here. "ANSI" is a locale-dependent single byte encoding (Windows-1252 on English versions), and "Unicode" is UTF-16. – ntoskrnl – 2014-04-11T21:04:29.177

@ntoskrnl That is correct, but if you are looking in the drop box for encoding it says ANSI, which is why I mentioned if you have a different OEM codepage you may get different results. – Darryl Braaten – 2014-04-12T15:32:44.870

2

If you are interested in how characters are stored, you can go to www.unicode.org and look around. At the top of their main page is a link "Code Charts" that shows you all the character codes that are available in Unicode.

All in all, there are a bit over one million codes available in Unicode (not all of them are used). One byte can hold 256 different values, so you would need three bytes if you wanted to store every possible Unicode code.

Instead, Unicode is usually stored in the "UTF-8" encoding which uses fewer bytes for some characters and more for others. The first 128 code values are stored in a single byte, up to the first 2048 code values are stored in two bytes, up to 65536 are stored in three bytes, and the rest take four bytes. This has been arranged so that code values that are used more often take less space. A-Z, a-z, 0-9 and !@$%^&*()-[}{};':"|,./<>? and some that I forgot take one byte; almost all of English, 98% of German and French (just guessing) can be stored in one byte per character, and these are the characters that are used most. Cyrillic, Greek, Hebrew, Arabic and some others use two bytes per character. Indian languages, most of Chinese, Japanese, Korean, Thai, tons of mathematical symbols, can be written in three bytes per character. Rare things (if you ever want to write text in Linear A or Linear B, Emojis) take four bytes.

Another encoding is UTF-16. Everything that takes 1, 2 or 3 bytes in UTF-8 takes two bytes in UTF-16. That's an advantage if you have Chinese or Japanese text with very few latin characters in between.

About the reasons for the UTF-8 design: It has several advantages over other designs. They are:

Compatibility with US-ASCII characters

Reasonable compactness

Self-synchronisation: This means that if you are given part of a sequence of bytes which are characters in UTF-8 encoding, you can find out where character starts. In some encodings, both xy and yx could be valid encodings of characters, so if you are given part of a sequence ... xyxyxyxyxyxy ... you cannot know what characters you have.

Sorting correctness: If you sort strings containing UTF-8 encoded characters by their byte values, then they are automatically sorted correctly according to their Unicode values.

Compatible with single-byte code: Most code that assumes single byte values works automatically correctly with UTF-8 encoded characters.

Plus whatever reasons I forgot.

gnasher729

Posted 2014-04-11T18:07:37.283

Reputation: 277