There are many reasons why English characters need fewer bytes to represent them than characters in other alphabets. For one, English is a more complex language than most other languages, and so the number of bytes needed to represent a character in English is greater than it is in most other languages. Additionally, English is written using a different alphabet than most other languages, which means that the number of bytes needed to represent a character in English is also greater. Finally, English has more letters than any other language, which means that the number of bytes needed to represent a character in English is also greater. All of these factors together make English characters require more bytes to represent them than characters in other alphabets. This can lead to some problems when it comes time to create text files or software that uses English characters. For example, if you want to create a text file that uses all of the letters from the alphabet, you will need to create a file with many more bytes per letter than necessary for text files created in other alphabets.
While most of us have probably never stopped to think about it, alphabetical characters are not all the same size in the number of bytes it takes to represent them. But why is that? Today’s SuperUser Q&A post has the answers to a curious reader’s question.
Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites.
Partial ASCII Chart screenshot courtesy of Wikipedia.
The Question
SuperUser reader khajvah wants to know why different alphabets take up different amounts of disk space when saved:
Letters are letters, right? Maybe not! What is the answer to this alphabetical mystery?
What is the difference between alphabets on a computer? Why does English take up less space when saved?
The Answer
SuperUser contributors Doktoro Reichard and ernie have the answer for us. First up, Doktoro Reichard:
Followed by the answer from ernie:
The English alphabet uses part of the Latin alphabet (for instance, there are few accented words in English). There are 26 individual letters in that alphabet, not considering case. And there would also have to exist the individual numbers and punctuation marks in any scheme that pretends to encode the English alphabet.
The 1960s was also a time when computers did not have the amount of memory or disk space that we have now. ASCII was developed to be a standard representation of a functional alphabet across all American computers. At the time, the decision to make every ASCII character 8 bits (1 byte) long was made due to technical details of the time (the Wikipedia article mentions the fact that perforated tape held 8 bits in a position at a time). In fact, the original ASCII scheme can be transmitted using 7 bits, and the eighth could be used for parity checks. Later developments expanded the original ASCII scheme to include several accented, mathematical, and terminal characters.
With the recent increase of computer usage across the world, more and more people from different languages had access to a computer. That meant that, for each language, new encoding schemes had to be developed, independently from other schemes, which would conflict if read from different language terminals.
Unicode came into being as a solution to the existence of different terminals by merging all possible meaningful characters into a single abstract character set.
UTF-8 is one way to encode the Unicode character set. It is a variable-width encoding (i.e. different characters can have different sizes) and it was designed for backwards compatibility with the former ASCII scheme. As such, the ASCII character set will remain one byte in size whilst any other characters are two or more bytes in size. UTF-16 is another way to encode the Unicode character set. In comparison to UTF-8, characters are encoded as either a set of one or two 16-bit code units.
As stated in other comments, the ‘a’ character occupies a single byte while ‘ա’ occupies two bytes, denoting a UTF-8 encoding. The extra byte in the original question was due to the existence of a newline character at the end.
Have something to add to the explanation? Sound off in the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.
For languages that require more possibilities than this, a simple 1 to 1 mapping can not be maintained, so more data is needed to store a character.
Note that generally, most encodings use the first 7 bits (128 values) for ASCII characters. That leaves the 8th bit, or 128 more values for more characters. Add in accented characters, Asian languages, Cyrillic, etc. and you can easily see why 1 byte is not sufficient for holding all characters.