Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's useful if you want array-like semantics (e.g. O(1) lookup) on Unicode text strings, because you have a fixed size for every codepoint, unliked UTF-8. Python for example uses it internally.


And it compresses just as well as UTF-8 for transfer/storage purposes.


Except code point indexing simply isn’t useful.

In the words of the article: “The choice of UTF-32 (or Python 3-style code point sequences) arises from wanting the wrong thing.”


I think that's the issue here. People disagree on how useful or not useful it is. It's maybe not ideal, but I don't think it's anywhere near so bad as to be entirely not useful. Strings-are-sequences-of-bytes is worse in my opinion. Python literally used to have that. It was worse.


The problem with what Python used to have is that the encoding wasn’t fixed.

I’ll agree with you that strings-are-sequences-of-bytes is bad. That’s painful compiler-flag, codepage, &c. territory.

But what’s not bad is strings-are-sequences-of-code-units. That’s what Rust has, for example. Rust strings aren’t sequences of bytes, but of UTF-8 code units, and the two are semantically very different.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: