A popular but wrong way to convert a string to uppercase or lowercase
The article highlights common mistakes in C++ string case conversions, emphasizing the limitations of `std::tolower` and `std::towlower`, and recommends using `LCMapStringEx` or ICU functions for accuracy.
Read original articleThe article discusses common mistakes in converting strings to uppercase or lowercase in C++. A prevalent method involves using `std::transform` with `std::tolower`, which is incorrect for several reasons. Firstly, `std::tolower` is not addressable, necessitating the use of a lambda function. Secondly, it is designed for narrow characters, leading to potential issues when applied to wide characters (wchar_t). Even when corrected to use `std::towlower`, the method fails to account for context-sensitive case mappings, particularly with UTF-16 encoding, where characters outside the basic multilingual plane are represented by pairs of wchar_t values. This can result in incorrect conversions. Additionally, the uppercase and lowercase forms of characters may differ in length, complicating the transformation process. The article suggests using `LCMapStringEx` for case mapping or the ICU library functions `u_strToUpper` and `u_strToLower` as more reliable alternatives. These methods handle the complexities of character encoding and locale-specific rules, ensuring accurate string transformations.
- Common string case conversion methods in C++ are often incorrect.
- `std::tolower` and `std::towlower` have limitations with wide characters and context-sensitive mappings.
- UTF-16 encoding can complicate character transformations due to paired representations.
- Case mappings may change string lengths, leading to potential errors.
- Recommended solutions include using `LCMapStringEx` or ICU library functions for accurate conversions.
Related
The Byte Order Fiasco
Handling endianness in C/C++ programming poses challenges, emphasizing correct integer deserialization to prevent undefined behavior. Adherence to the C standard is crucial to avoid unexpected compiler optimizations. Code examples demonstrate proper deserialization techniques using masking and shifting for system compatibility. Mastery of these concepts is vital for robust C code, despite available APIs for byte swapping.
Some sanity for C and C++ development on Windows
C and C++ development on Windows historically struggled due to limited native standard library support, causing compatibility issues, especially with non-ASCII characters. Workarounds like libwinsane address these limitations, but challenges persist despite recent improvements in Unicode support.
A Type for Overload Set
The article explores C++ overload set challenges, discussing issues with standard functions encountering problems due to overloading. It introduces proposal P3312 for a unique type to address these limitations, emphasizing the need for a more efficient solution.
Strlcpy and how CPUs can defy common sense
The article compares the performance of `strlcpy` in OpenBSD and glibc, revealing glibc's faster execution despite double traversal, emphasizing instruction-level parallelism and advocating for sized strings for efficiency.
tolower() small string performance
Tony Finch's blog post analyzes the performance of the `tolower()` function for small strings, revealing that scalar code is faster for strings under 5 bytes, while AVX-512 excels for longer strings.
If it is text game needs to show to user then every version of the text that is needed is a translated text. Programmer will never know if context or locale will need word order changes or anything complicated. Just trust the translation team.
If text is coming from user - then change design until its not needed to 'convert'. There are major issues just to show user back what he entered! Because the font for editing and displayed text could be different. Not even mentioning RTL and other issues.
Once ppl learn about localization the questions like why a programming language does not do this 'simple text operation' are just a newcomer detector. :)
An acceptable solution is given at the end of the article:
> If you use the International Components for Unicode (ICU) library, you can use u_strToUpper and u_strToLower.
Makes you wonder why this isn't part of the C++ standard library itself. Every revision of the C++ standard brings with itself more syntax and more complexity in the language. But as a user of C++ I don't need more syntax and more complexity in the language. But I do need more standard library functions that solves these ordinary real-world programming problems.
If you are handling multilingual text the locale is mandatory metadata.
That said, 99% time when doing upper- or lowercase operation you're interested just in the 7-bit ASCII range of characters.
For the remaining 1%, there's ICU library. Just like Raymond Chen mentioned.
I had to do that. When we had our steampunk telegraph office at steampunk conventions [1], people could text in a message via SMS, it would be printed on a Model 14 or 15 Teletype, put in an envelope, and hand-delivered. People would use emoji in messages, and the device could only print Baudot, or International Telegraphic Alphabet #2, which is upper case only with some symbols.
Emoji translation would cause the machine to hammer out
(RED-HEART)
or whatever emoji description was needed.Used the emoji list at [2], an older version.
[1] https://vimeo.com/124065314
[2] http://unicode.org/emoji/charts-beta/full-emoji-list.html
For the former case, you don't need any complex logic. A very typical example would be: i'm serializing a field or constructing a url so I want the variable name "Someproperty" as a lower case string. The lowercase transform is completely naive. I know exactly what the range of possible characters are and they aren't going to be Turkish or emoji, not least because I have asserted they won't be. And THIS is what the regular programming functions for upper/lower case are for. They are important, and they are most often correct. Because for all the other cases (i18n, user input, ...) you probably don't want to do toUpper/toLower at all to begin with!
Example, if you present a message to the user from resources so your code is translate("USER_DIALOG_QUESTION_ABOUT_FISH") which you want to lookup knowing it will be in sentence case, and present as uppercase, what will you do? Here you likely can't, and shouldn't, do toUpper(translate(resourceKey)). Just use two resources if you want to correctly transform text. The toUpper function isn't made for this.
Trying to use a complex i18n-ready toUpper/toLower only helps part of the way. It still might not understand whether two S are contracted or whether something is a proper Noun and must stay capitalized. So it adds complexity and still isn't correct. Just use two resources!
That's incorrect, using diacritics on capital letters is always the preferred form, it's just that dropping them is acceptable as it was often done for technical reasons.
I'm not sure about other languages, but Swift has pretty intense String support[0], and can go quite a long ways.
Someone actually wrote a whole book about just Swift Strings[1].
[0] https://docs.swift.org/swift-book/documentation/the-swift-pr...
Furthermore, the proper way to do case folding will depend on such things as the character set, the language, the specific context of the text being converted (e.g. in some cases specific letters are required, such as abbreviations of the names of SI units), etc. And then, it is not necessarily only "uppercase" and "lowercase", anyways.
There might even be different ways to do by the same language, with possibly disagreements about usage (e.g. the German Eszett did not have an official capital form until 2017, although apparently some type designers did it anyways (and it was in Unicode before then, despite that)).
If the character set is Unicode, then there is not actually the correct way to do it, despite what the Unicode Conspiracy insists otherwise.
Also, for some uses the way that it will need to be done, there will be a specific way that it is required (due to the way that a file format or a protocol or whatever is working), so in such a case if the character set is something other than ASCII then you cannot just assume that it will always work in the same way.
You also cannot necessarily depend on the locale for such a thing, since it might depend on the data, as well.
These things can be as bad as they are, but Unicode just makes these things worse than that. If a program requires a specific case folding and then it will not work because it is the wrong version of Unicode and it is possible to be a security issue and/or other problems.
(Another problem, which applies even if you do not use case folding, is that some people think that all text is or should be Unicode and that one character set is suitable for everything. Actually, one character set cannot be suitable for everything, regardless of what character set it is. Even if it was (which it isn't), it wouldn't be Unicode.)
> wchar_t
Man, I'm happy we don't need to deal with this crap in Rust, and we can just use String::to_lowercase. Not having to worry about things makes coding fun.
Related
The Byte Order Fiasco
Handling endianness in C/C++ programming poses challenges, emphasizing correct integer deserialization to prevent undefined behavior. Adherence to the C standard is crucial to avoid unexpected compiler optimizations. Code examples demonstrate proper deserialization techniques using masking and shifting for system compatibility. Mastery of these concepts is vital for robust C code, despite available APIs for byte swapping.
Some sanity for C and C++ development on Windows
C and C++ development on Windows historically struggled due to limited native standard library support, causing compatibility issues, especially with non-ASCII characters. Workarounds like libwinsane address these limitations, but challenges persist despite recent improvements in Unicode support.
A Type for Overload Set
The article explores C++ overload set challenges, discussing issues with standard functions encountering problems due to overloading. It introduces proposal P3312 for a unique type to address these limitations, emphasizing the need for a more efficient solution.
Strlcpy and how CPUs can defy common sense
The article compares the performance of `strlcpy` in OpenBSD and glibc, revealing glibc's faster execution despite double traversal, emphasizing instruction-level parallelism and advocating for sized strings for efficiency.
tolower() small string performance
Tony Finch's blog post analyzes the performance of the `tolower()` function for small strings, revealing that scalar code is faster for strings under 5 bytes, while AVX-512 excels for longer strings.