UI: Text Object International Case Change #106581

Harley Acheson · 2023-04-05T03:36:14+02:00

Harley Acheson commented

2023-04-05 03:36:14 +02:00

Allow Text Object operator FONT_OT_case_set to correctly transform the case
of strings written in almost all scripts that differentiate letter case.

While editing a text object there is a "Text" menu that contains "To Uppercase" and "To Lowercase" that only operate on the lower ascii characters. This PR makes them work with almost all bicameral scripts. This includes all languages using Latin, Cyrillic, Greek, Coptic, Armenian, and other alphabets. Obviously this assumes that the loaded font properly supports the language.

There are some caveats though. These are only one-to-one mappings so can't always be correct for uppercase Σ since it has two lowercase forms depending on word position. Similarly lowercase ß won't become "SS".

Allow Text Object operator FONT_OT_case_set to correctly transform the case of strings written in almost all scripts that differentiate letter case. --- While editing a text object there is a "Text" menu that contains "To Uppercase" and "To Lowercase" that only operate on the lower ascii characters. This PR makes them work with almost all bicameral scripts. This includes all languages using Latin, Cyrillic, Greek, Coptic, Armenian, and other alphabets. Obviously this assumes that the loaded font properly supports the language. ![Wupperlower2.gif](/attachments/53393b7f-fa4c-41b7-8b9b-1ee290ba2397) There are some caveats though. These are only one-to-one mappings so can't always be correct for uppercase Σ since it has two lowercase forms depending on word position. Similarly lowercase ß won't become "SS".

Wupperlower2.gif

120 KiB

❤️ 2

Harley Acheson requested review from Campbell Barton 2023-04-05 03:37:39 +02:00

Campbell Barton commented

2023-04-05 05:18:39 +02:00

I'd rather avoid setlocale, both because temporarily setting it could back-fire and because the exact result depends on the operating system.

Attached a patch which uses a bad-level call into Python, which isn't great - but on balance I'd prefer to use this for consistency.

NOTE: resolving bad level calls is a fairly simple project which is worth looking into - but that can be done separately from this patch.

I'd rather avoid setlocale, both because temporarily setting it could back-fire and because the exact result depends on the operating system. Attached a patch which uses a bad-level call into Python, which isn't great - but on balance I'd prefer to use this for consistency. ---- NOTE: resolving bad level calls is a fairly simple project which is worth looking into - but that can be done separately from this patch.

editfont_case.diff

2.9 KiB

Campbell Barton requested changes 2023-04-05 06:53:28 +02:00

Campbell Barton left a comment

Requested changed in reply.

Harley Acheson force-pushed TextObjectCase from 2241b7c81a to 4209c4990d

2023-04-05 20:02:33 +02:00

Compare

Harley Acheson commented

2023-04-05 20:12:41 +02:00

@ideasman42 - I'd rather avoid setlocale...

Makes sense.

Looking into this further, the upper/lower mappings do not actually differ depending on locale. It is just that towupper and towlower don't work correctly unless you set a "*.utf8" locale.

So instead created BLI_utf32_upper and BLI_utf32_lower using data from these pages

The above is just data. The ICU itself is also GPL-Compatible (X license).

> @ideasman42 - I'd rather avoid setlocale... Makes sense. Looking into this further, the upper/lower mappings do not actually differ depending on locale. It is just that towupper and towlower don't work correctly unless you set a "*.utf8" locale. So instead created `BLI_utf32_upper` and `BLI_utf32_lower` using data from these pages * https://www.ibm.com/docs/en/i/7.2?topic=tables-unicode-lowercase-uppercase-conversion-mapping-table * https://www.ibm.com/docs/en/i/7.2?topic=tables-unicode-uppercase-lowercase-conversion-mapping-table The above is just data. The ICU itself is also GPL-Compatible (X license).

Campbell Barton commented

2023-04-06 05:01:14 +02:00

This PR highlights the need for a unicode library in Blender, so far we've been getting buy without one but it's limiting.

One of the more important uses of unicode-case conversion is case-insensitive-search which would be nice to support.

My concern with this PR is that it's adding unicode conversions that are slow (using a binary search, unlike Python's which indexes into arrays), and inlines unicode data which isn't updated based on changes to the unicode spec.
Performance wont be an issue with this operator, but using these for case-insensitive search could be an issue.

Python's unicode utilities are very close to what need as it supports case-conversion, categories such as alpha/decimal/digit/space/printable ... information so we could extract this into our own library, along with it's script to automate updates from the unicode consortium.
The whole-string conversion also supports lowercase ß properly, noted as a TODO in this PR.

Personally I think my patch is OK, although not ideal - as it's just postponing us using a more general unicode library, listing some possible alternatives.

Investigate existing libraries for unicode manipulation... or.
Extract Python's unicode utility functions into an intern/ library.
Move this operator to Python.
Check the performance of BLI_utf32_upper / BLI_utf32_lower compared to Python's functions, a binary search may have acceptable performance, even for interactive search.

This PR highlights the need for a unicode library in Blender, so far we've been getting buy without one but it's limiting. One of the more important uses of unicode-case conversion is case-insensitive-search which would be nice to support. My concern with this PR is that it's adding unicode conversions that are slow (using a binary search, unlike Python's which indexes into arrays), and inlines unicode data which isn't updated based on changes to the unicode spec. Performance wont be an issue with this operator, but using these for case-insensitive search could be an issue. Python's unicode utilities are very close to what need as it supports case-conversion, categories such as alpha/decimal/digit/space/printable ... information so we could extract this into our own library, along with it's script to automate updates from the unicode consortium. The whole-string conversion also supports lowercase `ß` properly, noted as a TODO in this PR. ---- Personally I think my patch is OK, although not ideal - as it's just postponing us using a more general unicode library, listing some possible alternatives. - Investigate existing libraries for unicode manipulation... or. - Extract Python's unicode utility functions into an `intern/` library. - Move this operator to Python. - Check the performance of `BLI_utf32_upper` / `BLI_utf32_lower` compared to Python's functions, a binary search may have acceptable performance, even for interactive search.

Harley Acheson force-pushed TextObjectCase from 4209c4990d to 5c2c4b6735

2023-04-06 20:22:32 +02:00

Compare

Harley Acheson commented

2023-04-06 20:37:23 +02:00

Check the performance of BLI_utf32_upper / BLI_utf32_lower compared to Python's functions, a binary search may have acceptable performance, even for interactive search.

I didn't anticipate any uses that would require fast performance, so have updated this patch to maximize this.

It now directly calculates upper/lower offset for the ranges where this can be done (lower Latin, parts of extended Latin, Armenian, Georgian, Enclosed letterforms, and Fullwidth letterforms.

It now also only does the binary search of the character arrays if we are in three specific ranges where direct calculation is not possible. It early exits as much as possible.

> - Check the performance of `BLI_utf32_upper` / `BLI_utf32_lower` compared to Python's functions, a binary search may have acceptable performance, even for interactive search. > I didn't anticipate any uses that would require fast performance, so have updated this patch to maximize this. It now directly calculates upper/lower offset for the ranges where this can be done (lower Latin, parts of extended Latin, Armenian, Georgian, Enclosed letterforms, and Fullwidth letterforms. It now also only does the binary search of the character arrays if we are in three specific ranges where direct calculation is not possible. It early exits as much as possible.

Harley Acheson force-pushed TextObjectCase from 5c2c4b6735 to 6f7aeb4070

2023-04-06 23:31:46 +02:00

Compare

Campbell Barton approved these changes 2023-04-07 05:15:03 +02:00

Campbell Barton reviewed 2023-04-07 05:18:40 +02:00

source/blender/blenlib/BLI_string_utf8.h Outdated

						
				@ -174,0 +177,4 @@

				 * mappings so this doesn't work corectly for uppercase Σ (two lowercase forms) and lowercase ß

				 * won't become "SS".

				 */

				char32_t BLI_utf32_upper(char32_t wc);

Campbell Barton commented

2023-04-07 05:18:40 +02:00

Prefer BLI_str_utf32_char_to_upper / BLI_str_utf32_char_to_lower - which is in keeping with BLI_str_* API.

We could also add a BLI_string_utf32.h, or consider renaming BLI_string_utf8.h to BLI_string_unicode.h since it doesn't make sense to add utf32 functions to a utf8 named header.

Suggest to do this as part of a separate commit though.

Prefer `BLI_str_utf32_char_to_upper` / `BLI_str_utf32_char_to_lower` - which is in keeping with `BLI_str_*` API. We could also add a `BLI_string_utf32.h`, or consider renaming `BLI_string_utf8.h` to `BLI_string_unicode.h` since it doesn't make sense to add utf32 functions to a `utf8` named header. Suggest to do this as part of a separate commit though.

Harley marked this conversation as resolved

Campbell Barton reviewed 2023-04-07 05:19:07 +02:00

source/blender/blenlib/intern/string_utf8.c Outdated

						
				@ -402,0 +420,4 @@

				  if (wc <= U'\x24E9' && wc >= U'\x24D0') { /* Enclosed ⓐ - ⓩ */

				    return wc - 26;

				  }

				  if (wc <= U'\xFF5A' && wc >= U'\xFF41') { /* Fullwidth ａ - ｚ */

Campbell Barton commented

2023-04-07 05:19:07 +02:00

Avoid unicode in comments a and z are fine here.

Avoid unicode in comments `a` and `z` are fine here.

Harley marked this conversation as resolved

Campbell Barton reviewed 2023-04-07 05:20:09 +02:00

source/blender/blenlib/intern/string_utf8.c Outdated

						
				@ -402,0 +417,4 @@

				    /* Armenian & Georgian */

				    return wc - 48;

				  }

				  if (wc <= U'\x24E9' && wc >= U'\x24D0') { /* Enclosed ⓐ - ⓩ */