Strings are (not) character arrays

Every intro to programming teaches the same lie: strings are arrays of characters.
This made sense when characters fit in bytes and English was the only language that mattered. But that world is long gone, and we’re still pretending the lie is true. There’s no such thing as “a character”.

Ask Rust for the length of 👨‍👩‍👧‍👦 and it says 25. Ask C# for the same and it says 11. Ask a human and they say 1. They are all correct. They’re also all wrong. This isn’t a bug, It’s what happens when you try to put “characters” in an array.

fn main() {
    let s = "👨‍👩‍👧‍👦";
    println!("{}", s.len()); // 25
}

public static void Main()
{
  string s = "👨‍👩‍👧‍👦";
  Console.WriteLine($"{s.Length}"); / 11
}

The lie we all learned

Early on, someone taught you that strings are array of characters. That made perfect sense in the ASCII world:

'H' -> 0x48 -> 01001000

One character, one byte, one slot in the array. Length, indexing, slicing—all trivial. This model worked so well it become gospel. But it only worked because of an assumption nobody said out loud: a character is whatever fits in a byte.

Unicode broke the assumption, not the API

When text needed to represent all human languages, ASCII collapsed. Unicode came along and did something reasonable: it assigned every symbol a unique number called a code point.

H → U+0048
👨 (MAN) → U+1F468
👩 (WOMAN) → U+1F469
☕ (HOT BEVERAGE) → U+2615

Code points are great. They answer “what is this symbol?” But they don’t answer “how many bytes does it take?” or “how do I store this?” That’s where encodings come in.

Encodings: everyone’s measuring different things

Computers store bytes. Encodings define how code points turn into bytes. And this is where the “character array” model dies completely. UTF-8 uses 1-4 bytes per code unit. ASCII characters take 1 byte, emoji takes multiple. Rust’s strings are UTF-8 so s.len() returns bytes. The family emoji? 25 bytes.

UTF-16 uses 16-bit code units. Characters in the Basic Multilingual Plane (up to U+FFFF) take one unit. Everything else takes two units called a surrogate pair. C#, Java and Javascript uses UTF-16. That same emoji? 11 code units.

Neither of them are counting “characters”. They’re counting storage units.

But wait, it gets worse

Even code points aren’t what humans see. The coffee emoji ☕ is actually two code points:

U+2615 (HOT BEVERAGE)
U+FE0F (VARIATION SELECTOR-16)

That family emoji 👨‍👩‍👧‍👦? Seven code points joined together:

👨(MAN) + ZERO WIDTH JOINER + 👩(WOMAN) + ZERO WIDTH JOINER + 👧(GIRL) + ZERO WIDTH JOINER + 👦(BOY)

What humans perceive as a single character is called a grapheme cluster. Most programming languages don’t count these by default because doing so requires Unicode normalization rules, language-specific context, and is genuninely expensive.

So what is the length of 👨‍👩‍👧‍👦?

What you count	Length
UTF-8 bytes	25
UTF-16 code units	11
Unicode code points	7
Grapheme clusters	1

All valid. All answering different questions. None of them “characters.”

The real problem

Strings were never character arrays. Even in ASCII, they were byte arrays that just happened to align with human intuition about “characters.” When we moved to Unicode, we kept the API but lost the alignment. Now every string operation is secretly making you choose:

Are you measuring storage? (bytes/code units)
Are you measuring Unicode’s answer? (code points)
Are you measuring human perception? (grapheme clusters)

The API doesn’t change. The question does.

UTF-8 won not because it solved this mess, but because it was the least-worst compromise. It was backward compatible with ASCII, space-efficient for English, capable of representing every symbol humanity has ever created.

But it’s still just bytes pretending to be characters.

Next time you call .length() on a string, remember: you’re not asking “how many characters are in this string?” You’re asking “how many [bytes|code units|code points] are in this encoding?”

The API won’t tell you which one. You’re expected to know.

That’s not a bug. That’s what happens when you try to preserve a mental model that was only true for 128 symbols.

# Strings are (not) character arrays

The lie we all learned

Unicode broke the assumption, not the API

Encodings: everyone’s measuring different things

But wait, it gets worse

The real problem

# Who Waits When You await? (Hint: Not your Threads)