Character Encoding Issues: Why “ö” sometimes displays as “ö”

Character encoding issues are surprisingly common, especially in languages with mostly Latin characters and a few special ones. This post explains why those strange “ö” symbols appear and what actually goes wrong behind the scenes.

A few weeks ago I got an email which read (in german):

“Unser Versandpartner meldet sich telefonisch oder per Mail zur Ankündigung der Lieferung. Aufgrund des großen Bestellaufkommens können wir dies aktuell jedoch nicht garantieren.”

This is a very common encoding phenomenon, which happens all the time. But why does this happen? To understand the problem, we first need to talk about what encoding means.

How computers represent text: ASCII

Computers generally only work with numbers, so there needs to be some way to interpret a number as a character. Once you know how a character is represented you can form any text as a sequence of numbers, also known as “strings”. This representation of numbers as characters is known as the character encoding.

One of the simplest encodings is known as ASCII and assigns numbers to unique characters. Although ASCII characters are stored in a byte, which can represent 256 possible values, ASCII only consists of 128, as originally one of the bits was used for error checking. This was later replaced by an extended ASCII version (ISO 8859-1), which makes use of all 256 possible characters. As there are only a few of them, the ASCII characters can easily be shown in a table:

No.CharNo.CharNo.CharNo.CharNo.CharNo.CharNo.CharNo.Char
0Ctrl-@16Ctrl-P32 48064@80P96`112p
1Ctrl-A17Ctrl-Q33!49165A81Q97a113q
2Ctrl-B18Ctrl-R34"50266B82R98b114r
3Ctrl-C19Ctrl-S35#51367C83S99c115s
4Ctrl-D20Ctrl-T36$52468D84T100d116t
5Ctrl-E21Ctrl-U37%53569E85U101e117u
6Ctrl-F22Ctrl-V38&54670F86V102f118v
7Ctrl-G23Ctrl-W39'55771G87W103g119w
8Ctrl-H24Ctrl-X40(56872H88X104h120x
9Ctrl-I25Ctrl-Y41)57973I89Y105i121y
10Ctrl-J26Ctrl-Z42*58:74J90Z106j122z
11Ctrl-K27Ctrl-[43+59;75K91[107k123{
12Ctrl-L28Ctrl-\44,60<76L92\108l124`
13Ctrl-M29Ctrl-]45-61=77M93]109m125}
14Ctrl-N30Ctrl-^46.62>78N94^110n126~
15Ctrl-O31Ctrl-_47/63?79O95_111o127DEL

This worked well with basic english text, however, there is one obvious flaw. You can only represent strings including these 128 characters. If you need a special character, like “ö”, it cannot be represented. Although there were local variants of ASCII, there was no uniform encoding everyone agreed on. This is where our next encoding comes into place, which is trying to fix this problem.

The UTF-8 encoding

UTF-8 is build as a superset of ASCII, to make it backwards compatible with ASCII. In order to store more characters, it uses a new approach called “variable-width encoding”. If a sequence of bytes starts with a special “leading byte”, it will be interpreted as 2-, 3- or 4-byte long characters.

This allows for more than enough characters to be stored, while still being storage efficient when using ASCII text. If you are working with a language that uses a lot of special characters, storage performance might increase as all your characters include the leading byte.

To address this UTF-16 was invented - to keep storage relatively efficient. It uses 16 bits or 2 bytes to store a single character while also using variable-width encoding for any special characters which cannot be represented by the 16 bits.

Choosing the wrong character encoding

As long as everyone agrees on one encoding, you can easily share your message by sending the bytes to your recipient. Problems start to arise when you send your message in a different encoding than your recipient expects.

For example: Someone sends me a message in UTF-8, representing “ö” as a 2 byte character. However, I interpret the bytes of the message as some form of extended ASCII, resulting in both bytes being shown as characters, namely “Ô and “¶”.

The same problem can also happen the other way around:

For example: Someone sends me a message in extended ASCII, representing “ö” by their 1 byte character. However, I now interpret the byte as UTF-8 and the original character looks like this: “�”.

How it can be prevented

Today, most services agree on UTF-8 encoding. Furthermore, to avoid misunderstandings of automated systems, almost all HTTP messages come with a special header attached to it which designates the content type and character set:

Content-Type: text/html; charset=UTF-8

To double down, HTML itself can also contain an additional meta tag to specify the character set.

<meta charset="UTF-8">

It’s still ironic that in order to find out the character set, one first has to read a special header string, which itself is encoded in a character set.

Conclusion

In german, as there are only a few special characters, no one really bothers reporting character encoding issues as it is still readable.

Now, coming back to the original message I got: I’m very confident my email client uses UTF-8. Instead what probably happened is that some backend process of the automatic email system assumed a wrong character encoding at some point while processing the message.

If you’ve been on the internet, you’ve seen these character encoding errors. They are very common. If you happen to come across one of these, try to figure out who messed up the encoding and how!