October 8th, 2024

A popular but wrong way to convert a string to uppercase or lowercase

The article highlights common mistakes in C++ string case conversions, emphasizing the limitations of `std::tolower` and `std::towlower`, and recommends using `LCMapStringEx` or ICU functions for accuracy.

Read original article

A popular but wrong way to convert a string to uppercase or lowercase

The article discusses common mistakes in converting strings to uppercase or lowercase in C++. A prevalent method involves using `std::transform` with `std::tolower`, which is incorrect for several reasons. Firstly, `std::tolower` is not addressable, necessitating the use of a lambda function. Secondly, it is designed for narrow characters, leading to potential issues when applied to wide characters (wchar_t). Even when corrected to use `std::towlower`, the method fails to account for context-sensitive case mappings, particularly with UTF-16 encoding, where characters outside the basic multilingual plane are represented by pairs of wchar_t values. This can result in incorrect conversions. Additionally, the uppercase and lowercase forms of characters may differ in length, complicating the transformation process. The article suggests using `LCMapStringEx` for case mapping or the ICU library functions `u_strToUpper` and `u_strToLower` as more reliable alternatives. These methods handle the complexities of character encoding and locale-specific rules, ensuring accurate string transformations.

- Common string case conversion methods in C++ are often incorrect.

- `std::tolower` and `std::towlower` have limitations with wide characters and context-sensitive mappings.

- UTF-16 encoding can complicate character transformations due to paired representations.

- Case mappings may change string lengths, leading to potential errors.

- Recommended solutions include using `LCMapStringEx` or ICU library functions for accurate conversions.

The Byte Order Fiasco

Handling endianness in C/C++ programming poses challenges, emphasizing correct integer deserialization to prevent undefined behavior. Adherence to the C standard is crucial to avoid unexpected compiler optimizations. Code examples demonstrate proper deserialization techniques using masking and shifting for system compatibility. Mastery of these concepts is vital for robust C code, despite available APIs for byte swapping.

Some sanity for C and C++ development on Windows

C and C++ development on Windows historically struggled due to limited native standard library support, causing compatibility issues, especially with non-ASCII characters. Workarounds like libwinsane address these limitations, but challenges persist despite recent improvements in Unicode support.

A Type for Overload Set

The article explores C++ overload set challenges, discussing issues with standard functions encountering problems due to overloading. It introduces proposal P3312 for a unique type to address these limitations, emphasizing the need for a more efficient solution.

Strlcpy and how CPUs can defy common sense

The article compares the performance of `strlcpy` in OpenBSD and glibc, revealing glibc's faster execution despite double traversal, emphasizing instruction-level parallelism and advocating for sized strings for efficiency.

tolower() small string performance

Tony Finch's blog post analyzes the performance of the `tolower()` function for small strings, revealing that scalar code is faster for strings under 5 bytes, while AVX-512 excels for longer strings.

25 comments

By @SleepyMyroslav - 7 months

In gamedev there is simple rule: don't try to do any of that.

If it is text game needs to show to user then every version of the text that is needed is a translated text. Programmer will never know if context or locale will need word order changes or anything complicated. Just trust the translation team.

If text is coming from user - then change design until its not needed to 'convert'. There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.

Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)

By @blenderob - 7 months

It is issues like this due to which I gave up on C++. There are so many ways to do something and every way is freaking wrong!

An acceptable solution is given at the end of the article:

> If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.

Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.

By @appointment - 7 months

The key takeaway here is that you can't correctly process a string if you don't what language it's in. That includes variants of the same language with different rules, eg en-US and en-UK or es-MX and es-ES.

If you are handling multilingual text the locale is mandatory metadata.

By @vardump - 7 months

As always, Raymond is right. (And as usually, I could guess it's him before even clicking the link.)

That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.

For the remaining 1%, there's ICU library. Just like Raymond Chen mentioned.

By @Animats - 7 months

> From the article: "I find it quaint that Unicode character names are ALL IN CAPITAL LETTERS, in case you need to put them in a Baudot telegram or something."

I had to do that. When we had our steampunk telegraph office at steampunk conventions [1], people could text in a message via SMS, it would be printed on a Model 14 or 15 Teletype, put in an envelope, and hand-delivered. People would use emoji in messages, and the device could only print Baudot, or International Telegraphic Alphabet #2, which is upper case only with some symbols.

Emoji translation would cause the machine to hammer out

    (RED-HEART)

or whatever emoji description was needed.

Used the emoji list at [2], an older version.

[1] https://vimeo.com/124065314

[2] http://unicode.org/emoji/charts-beta/full-emoji-list.html

By @alkonaut - 7 months

Handle text in two ways: either it's controlled by you and you can do simple, efficient, and naive processing, or it's not (it's translated resources, or user input) and you can't.

For the former case, you don't need any complex logic. A very typical example would be: i'm serializing a field or constructing a url so I want the variable name "Someproperty" as a lower case string. The lowercase transform is completely naive. I know exactly what the range of possible characters are and they aren't going to be Turkish or emoji, not least because I have asserted they won't be. And THIS is what the regular programming functions for upper/lower case are for. They are important, and they are most often correct. Because for all the other cases (i18n, user input, ...) you probably don't want to do toUpper/toLower at all to begin with!

Example, if you present a message to the user from resources so your code is translate("USER_DIALOG_QUESTION_ABOUT_FISH") which you want to lookup knowing it will be in sentence case, and present as uppercase, what will you do? Here you likely can't, and shouldn't, do toUpper(translate(resourceKey)). Just use two resources if you want to correctly transform text. The toUpper function isn't made for this.

Trying to use a complex i18n-ready toUpper/toLower only helps part of the way. It still might not understand whether two S are contracted or whether something is a proper Noun and must stay capitalized. So it adds complexity and still isn't correct. Just use two resources!

By @PhilipRoman - 7 months

Thought this was going to be about and-not-ing bytes with 0x20. Wrong for most inputs but sure as hell faster than anything else.

By @cyxxon - 7 months

Small nitpick: the example "LATIN SMALL LETTER SHARP S (“ß” U+00DF) uppercases to the two-character sequence “SS”:³ Straße ⇒ STRASSE" is slightly wrong, it seems to me, as we now do actually have a uppercase version of that, so it should uppercase to "Latin Capital Letter Sharp S" (U+1E9E). The double-S thing is still widely used, though.

By @himinlomax - 7 months

> And in certain forms of the French language, capitalizing an accented character causes the accent to be dropped: à Paris ⇒ A PARIS.

That's incorrect, using diacritics on capital letters is always the preferred form, it's just that dropping them is acceptable as it was often done for technical reasons.

By @ChrisMarshallNY - 7 months

I generally just use the language-supported tolower/upper() (or similar) routines. I assume that they take things like UTF and alternative type systems into account.

I'm not sure about other languages, but Swift has pretty intense String support[0], and can go quite a long ways.

Someone actually wrote a whole book about just Swift Strings[1].

[0] https://docs.swift.org/swift-book/documentation/the-swift-pr...

[1] https://flight.school/books/strings/

By @serbuvlad - 7 months

The real insights here are that strings in C++ suck and UTF-16 is extremely unintuitive.

By @zzo38computer - 7 months

First, you should consider if you even need case folding; for many uses it will be unnecessary, anyways.

Furthermore, the proper way to do case folding will depend on such things as the character set, the language, the specific context of the text being converted (e.g. in some cases specific letters are required, such as abbreviations of the names of SI units), etc. And then, it is not necessarily only "uppercase" and "lowercase", anyways.

There might even be different ways to do by the same language, with possibly disagreements about usage (e.g. the German Eszett did not have an official capital form until 2017, although apparently some type designers did it anyways (and it was in Unicode before then, despite that)).

If the character set is Unicode, then there is not actually the correct way to do it, despite what the Unicode Conspiracy insists otherwise.

Also, for some uses the way that it will need to be done, there will be a specific way that it is required (due to the way that a file format or a protocol or whatever is working), so in such a case if the character set is something other than ASCII then you cannot just assume that it will always work in the same way.

You also cannot necessarily depend on the locale for such a thing, since it might depend on the data, as well.

These things can be as bad as they are, but Unicode just makes these things worse than that. If a program requires a specific case folding and then it will not work because it is the wrong version of Unicode and it is possible to be a security issue and/or other problems.

(Another problem, which applies even if you do not use case folding, is that some people think that all text is or should be Unicode and that one character set is suitable for everything. Actually, one character set cannot be suitable for everything, regardless of what character set it is. Even if it was (which it isn't), it wouldn't be Unicode.)

By @high_na_euv - 7 months

In cpp basic things are hard

By @flareback - 7 months

He gave 4 examples of how it's done incorrectly, but zero actual examples of doing it correctly.

By @HPsquared - 7 months

I thought this was going to be about adding or subtracting 32. Old school.

By @codr7 - 7 months

C++, where every line of code is a book waiting to be written.

By @guerrilla - 7 months

C is hard. It seems like C++ just made things way harder. I don't regret skipping it. Why not just go right to Java, C#, JS, Haskell, etc. and do what you need in C.

By @account42 - 7 months

A popular but wrong way to do Unicode

> wchar_t

By @PoignardAzur - 7 months

So I'm going to be that guy and say it:

Man, I'm happy we don't need to deal with this crap in Rust, and we can just use String::to_lowercase. Not having to worry about things makes coding fun.

By @the_gorilla - 7 months

Why are some functions addressable in C++ and others not? Seems like a pointless design oversight.

By @ahartmetz - 7 months

...and that is why you use QString if you are using the Qt framework. QString is a string class that actually does what you want when used in the obvious way. It probably helps that it was mostly created by people with "ASCII+" native languages. Or with customers that expect not exceedingly dumb behavior. The methods are called QString::toUpper() and QString::toLower() and take only the implicit "this" argument, unlike Win32 LCMapStringEx() which takes 5-8 arguments...