EncodingUnicodeDeveloper Tools

Character Encoding Explained: From ASCII to UTF-8

Why your text sometimes turns into question marks and weird symbols. A practical guide to character encoding.

RunToolz TeamJanuary 16, 20264 min read

You open a file and see Ã¼ instead of u. Or a database returns ???? where someone's name should be. Or an email arrives with =?UTF-8?B? scattered through the subject line.

Welcome to the wonderful world of character encoding problems.

The Short History

Computers store numbers, not letters. So someone had to decide which number means which letter. In the 1960s, ASCII assigned numbers 0-127 to English letters, digits, and basic symbols. The letter "A" is 65. A space is 32. Simple.

But ASCII only covers 128 characters. That works for English. It doesn't work for German umlauts, Japanese kanji, Arabic script, or the thousands of other characters humans actually use.

The Chaos Before Unicode

Different regions invented their own encodings. Western Europe got ISO-8859-1. Japan got Shift-JIS. Russia got KOI8-R. China got GB2312. Each worked fine within its own ecosystem. The moment you mixed them, everything broke.

A file saved in one encoding and opened with another gives you mojibake -- that garbled mess of wrong characters you've probably seen. cafe becomes cafÃ© when a UTF-8 file is read as ISO-8859-1.

Unicode Fixed the Mapping Problem

Unicode gave every character a unique number (called a "code point"). Latin A is U+0041. The snowman is U+2603. Every emoji, every script, every mathematical symbol gets its own code point. Over 150,000 characters and counting.

But Unicode is just the mapping. It doesn't say how to store those numbers as bytes. That's where encoding comes in.

Ready to try it yourself?Count Characters and Bytes

UTF-8: The Encoding That Won

UTF-8 is the way most of the internet stores Unicode text. Its key trick: it uses a variable number of bytes per character.

ASCII characters (English letters, digits): 1 byte each
European accented characters: 2 bytes each
Asian characters (CJK): 3 bytes each
Emojis and rare symbols: 4 bytes each

This means English text in UTF-8 is identical to ASCII. Old systems don't break. But you can still represent any character in the world.

As of today, over 98% of websites use UTF-8. The encoding war is over, and UTF-8 won.

UTF-8 vs UTF-16 vs UTF-32

UTF-8: Variable width (1-4 bytes). Efficient for English-heavy text. Web standard.

UTF-16: Variable width (2 or 4 bytes). Used internally by JavaScript, Java, and Windows. Every character is at least 2 bytes, so it's less efficient for ASCII text.

UTF-32: Fixed width (4 bytes per character). Simple but wasteful. Rarely used for storage or transmission.

JavaScript's string.length counts UTF-16 code units, not characters. That's why "😀".length returns 2, not 1.

When Encoding Goes Wrong

Reading a file with the wrong encoding. The bytes are fine, but they're being interpreted wrong. Solution: specify the correct encoding when opening.

Database charset mismatch. Your app sends UTF-8, but the database column is set to latin1. Characters outside ASCII get mangled. Solution: set your database to utf8mb4 (not just utf8 in MySQL, which only handles 3-byte characters).

HTTP missing charset header. If the server doesn't send Content-Type: text/html; charset=utf-8, browsers have to guess. They sometimes guess wrong.

Ready to try it yourself?URL Encode/Decode Text

Practical Tips

Always use UTF-8. Unless you have a very specific reason not to, UTF-8 is the right choice for everything.

Declare your encoding explicitly. In HTML: <meta charset="utf-8">. In HTTP: Content-Type: text/html; charset=utf-8. Don't make systems guess.

Be careful with string length. In JavaScript, character counting gets tricky with emojis and combining characters. Use Array.from(str).length or the Intl.Segmenter API for accurate counts.

Watch for BOM. The Byte Order Mark (U+FEFF) sometimes appears at the start of UTF-8 files. It's invisible but can break parsers. Some editors add it silently.

Character encoding isn't exciting, but understanding it saves hours of debugging. Use UTF-8 everywhere, declare it explicitly, and when you see garbled text, you'll know exactly where to look.

🔧 Related Tools

🔐

Base64 Encoder/Decoder

Encode or decode Base64 strings instantly

🔢

Character Counter

Count characters, words, and lines in text

🔗

URL Encoder/Decoder

Encode or decode URL strings

#️⃣

Hash Generator

Generate MD5, SHA-1, SHA-256 hashes