Skip to content Skip to sidebar Skip to footer

Unable To Get Uppercase To 'ß' (german Character Called Eszett)

Hello I have to convert a string column into its uppercase version, but when 'ß' is present in the string, it gets changed to 'SS' while doing uppercase I understand that this is

Solution 1:

Various Python versions use specific Unicode versions. For example, I think the original Python 3.7 used Unicode 10.0.0 which, while it has the letter available (it has had it since Unicode 5.1, I believe), still lists the old upper/lower mapping:

00DF ß LATINSMALLLETTERSHARPS=Eszett-German- uppercase is"SS"- nonstandard uppercase is 1E9E ẞ
1E9E ẞ LATINCAPITALLETTERSHARPS- lowercase is 00DF ß

Even the latest standard at the time of this answer, 13.0.0 (though this change was made in 11.0.0), appears to allow discretion as to how to convert lower to upper:

00DF ß LATINSMALLLETTERSHARPS=Eszett-German- not used inSwissHighGerman- uppercase is"SS" or 1E9E ẞ
1E9E ẞ LATINCAPITALLETTERSHARPS- not used inSwissHighGerman- lowercase is 00DF ß

The following table maps some Python version to Unicode version:

 Python     Unicode
--------    -------
   3.5.9      8.0.0
  3.6.11      9.0.0
   3.7.8     11.0.0
3.8.4rc1     12.1.0
 3.9.0b4     13.0.0
3.10.0a0     13.0.0

So you may well have to wait for a later version of Unicode (and a Python that uses that Unicode version) where the mapping is a little less wishy-washy than uppercase is "SS" or 1E9E ẞ". But this may actually be precluded by the Unicode stability policy which states, in part:

If two characters form a case pair in a version of Unicode, they will remain a case pair in each subsequent version of Unicode. If two characters do not form a case pair in a version of Unicode, they will never become a case pair in any subsequent version of Unicode.

You can make a case pair from a newly introduced character, assuming that the one you want to pair with is not already paired but that's not allowed here since:

  • this "new" character was introduced way back in Unicode 5.1; and
  • the character we'd want to pair it with is already paired.

My reading of that leads me to believe that the only way to fix this without violating that policy, would be to introduce two new characters in a case pair, something like:

ß LATIN SMALL LETTER SHARP S THAT IS LOWER OF ẞ
ẞ LATIN CAPITAL LETTER SHARP S THAT IS UPPER OF ß

However, I'm not sure that'll ever get past the Unicode consortium silliness filters :-)

For an immediate fix, you can simply force that specific character to whatever you want it to be, before applying the inbuilt case change, something like:

to_be_uppered.replace('ß', 'ẞ').upper()
to_be_lowered.replace('ẞ', 'ß').lower()

The latter appears to be unnecessary, at least on my version, Python 3.8.2. I include it just in case an earlier Python version may need it. It may even be worth putting these into a custome my_upper() and my_lower() function, if it turns out there are more cases like this that you need to handle.

Solution 2:

That is the behaviour in many other languages as well, you may work around it like this:

my_string.replace('ß', 'ẞ').upper()

Solution 3:

Apply the unil's workaround (+1):

my_string.replace('ß', 'ẞ').upper()

I can't see any other solution due some kind of political correctness found in Unicode documents:

  • from Character Properties, Case Mappings & Names FAQ:

  • Q: Is all of the Unicode case mapping information in UnicodeData.txt?

    A: No. The UnicodeData.txt file includes all of the one-to-one case mappings. Since many parsers were built with the expectation that UnicodeData.txt would have at most a single character in each case mapping field, the file SpecialCasing.txt was added to provide the one-to-many mappings, such as the one needed for uppercasing ß (U+00DF LATIN SMALL LETTER SHARP S). In addition, CaseFolding.txt contains additional mappings used in case folding and caseless matching. For more information, see Section 5.18, Case Mappings in The Unicode Standard.

  • Q: Why does ß (U+00DF LATIN SMALL LETTER SHARP S) not uppercase to U+1E9E LATIN CAPITAL LETTER SHARP S by default?

    A: In standard German orthography, the sharp s ("ß") used to be exclusively uppercased to a sequence of two capital S characters. This longstanding practice is reflected in the default case mappings in Unicode. A capital form of ß is sometimes preferred for typographic reasons or to avoid ambiguity, such as in uppercase names as found in passports. It is encoded in the Unicode Standard as U+1E9E. While this character is not widely used, is now recognized in the official orthography as an optional uppercase form of ß in addition to "SS". Because it is only an optional alternative, the original mapping to "SS" is retained in the Unicode character properties.

  • from SpecialCasing.txt

    The German es-zed is special--the normal mapping is to SS.

  • from UnicodeData.txt (see Uppercase mapping and Lowercase mapping fields as defined in UnicodeData File Format): lowercase mapping is defined for Latin Capital Letter Sharp S while uppercase mapping for Latin Small Letter Sharp S is not…

 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;;
 1E9E;LATIN CAPITAL LETTER SHARP S;Lu;0;L;;;;;N;;;;00DF;

Post a Comment for "Unable To Get Uppercase To 'ß' (german Character Called Eszett)"