UI: Text Object International Case Change #106581

Merged
Harley Acheson merged 1 commits from Harley/blender:TextObjectCase into main 2023-04-07 23:40:53 +02:00
Member

Allow Text Object operator FONT_OT_case_set to correctly transform the case
of strings written in almost all scripts that differentiate letter case.


While editing a text object there is a "Text" menu that contains "To Uppercase" and "To Lowercase" that only operate on the lower ascii characters. This PR makes them work with almost all bicameral scripts. This includes all languages using Latin, Cyrillic, Greek, Coptic, Armenian, and other alphabets. Obviously this assumes that the loaded font properly supports the language.

Wupperlower2.gif

There are some caveats though. These are only one-to-one mappings so can't always be correct for uppercase Σ since it has two lowercase forms depending on word position. Similarly lowercase ß won't become "SS".

Allow Text Object operator FONT_OT_case_set to correctly transform the case of strings written in almost all scripts that differentiate letter case. --- While editing a text object there is a "Text" menu that contains "To Uppercase" and "To Lowercase" that only operate on the lower ascii characters. This PR makes them work with almost all bicameral scripts. This includes all languages using Latin, Cyrillic, Greek, Coptic, Armenian, and other alphabets. Obviously this assumes that the loaded font properly supports the language. ![Wupperlower2.gif](/attachments/53393b7f-fa4c-41b7-8b9b-1ee290ba2397) There are some caveats though. These are only one-to-one mappings so can't always be correct for uppercase Σ since it has two lowercase forms depending on word position. Similarly lowercase ß won't become "SS".
Harley Acheson requested review from Campbell Barton 2023-04-05 03:37:39 +02:00

I'd rather avoid setlocale, both because temporarily setting it could back-fire and because the exact result depends on the operating system.

Attached a patch which uses a bad-level call into Python, which isn't great - but on balance I'd prefer to use this for consistency.


NOTE: resolving bad level calls is a fairly simple project which is worth looking into - but that can be done separately from this patch.

I'd rather avoid setlocale, both because temporarily setting it could back-fire and because the exact result depends on the operating system. Attached a patch which uses a bad-level call into Python, which isn't great - but on balance I'd prefer to use this for consistency. ---- NOTE: resolving bad level calls is a fairly simple project which is worth looking into - but that can be done separately from this patch.
Campbell Barton requested changes 2023-04-05 06:53:28 +02:00
Campbell Barton left a comment
Owner

Requested changed in reply.

Requested changed in reply.
Harley Acheson force-pushed TextObjectCase from 2241b7c81a to 4209c4990d 2023-04-05 20:02:33 +02:00 Compare
Author
Member

@ideasman42 - I'd rather avoid setlocale...

Makes sense.

Looking into this further, the upper/lower mappings do not actually differ depending on locale. It is just that towupper and towlower don't work correctly unless you set a "*.utf8" locale.

So instead created BLI_utf32_upper and BLI_utf32_lower using data from these pages

The above is just data. The ICU itself is also GPL-Compatible (X license).

> @ideasman42 - I'd rather avoid setlocale... Makes sense. Looking into this further, the upper/lower mappings do not actually differ depending on locale. It is just that towupper and towlower don't work correctly unless you set a "*.utf8" locale. So instead created `BLI_utf32_upper` and `BLI_utf32_lower` using data from these pages * https://www.ibm.com/docs/en/i/7.2?topic=tables-unicode-lowercase-uppercase-conversion-mapping-table * https://www.ibm.com/docs/en/i/7.2?topic=tables-unicode-uppercase-lowercase-conversion-mapping-table The above is just data. The ICU itself is also GPL-Compatible (X license).

This PR highlights the need for a unicode library in Blender, so far we've been getting buy without one but it's limiting.

One of the more important uses of unicode-case conversion is case-insensitive-search which would be nice to support.

My concern with this PR is that it's adding unicode conversions that are slow (using a binary search, unlike Python's which indexes into arrays), and inlines unicode data which isn't updated based on changes to the unicode spec.
Performance wont be an issue with this operator, but using these for case-insensitive search could be an issue.

Python's unicode utilities are very close to what need as it supports case-conversion, categories such as alpha/decimal/digit/space/printable ... information so we could extract this into our own library, along with it's script to automate updates from the unicode consortium.
The whole-string conversion also supports lowercase ß properly, noted as a TODO in this PR.


Personally I think my patch is OK, although not ideal - as it's just postponing us using a more general unicode library, listing some possible alternatives.

  • Investigate existing libraries for unicode manipulation... or.
  • Extract Python's unicode utility functions into an intern/ library.
  • Move this operator to Python.
  • Check the performance of BLI_utf32_upper / BLI_utf32_lower compared to Python's functions, a binary search may have acceptable performance, even for interactive search.
This PR highlights the need for a unicode library in Blender, so far we've been getting buy without one but it's limiting. One of the more important uses of unicode-case conversion is case-insensitive-search which would be nice to support. My concern with this PR is that it's adding unicode conversions that are slow (using a binary search, unlike Python's which indexes into arrays), and inlines unicode data which isn't updated based on changes to the unicode spec. Performance wont be an issue with this operator, but using these for case-insensitive search could be an issue. Python's unicode utilities are very close to what need as it supports case-conversion, categories such as alpha/decimal/digit/space/printable ... information so we could extract this into our own library, along with it's script to automate updates from the unicode consortium. The whole-string conversion also supports lowercase `ß` properly, noted as a TODO in this PR. ---- Personally I think my patch is OK, although not ideal - as it's just postponing us using a more general unicode library, listing some possible alternatives. - Investigate existing libraries for unicode manipulation... or. - Extract Python's unicode utility functions into an `intern/` library. - Move this operator to Python. - Check the performance of `BLI_utf32_upper` / `BLI_utf32_lower` compared to Python's functions, a binary search may have acceptable performance, even for interactive search.
Harley Acheson force-pushed TextObjectCase from 4209c4990d to 5c2c4b6735 2023-04-06 20:22:32 +02:00 Compare
Author
Member
  • Check the performance of BLI_utf32_upper / BLI_utf32_lower compared to Python's functions, a binary search may have acceptable performance, even for interactive search.

I didn't anticipate any uses that would require fast performance, so have updated this patch to maximize this.

It now directly calculates upper/lower offset for the ranges where this can be done (lower Latin, parts of extended Latin, Armenian, Georgian, Enclosed letterforms, and Fullwidth letterforms.

It now also only does the binary search of the character arrays if we are in three specific ranges where direct calculation is not possible. It early exits as much as possible.

> - Check the performance of `BLI_utf32_upper` / `BLI_utf32_lower` compared to Python's functions, a binary search may have acceptable performance, even for interactive search. > I didn't anticipate any uses that would require fast performance, so have updated this patch to maximize this. It now directly calculates upper/lower offset for the ranges where this can be done (lower Latin, parts of extended Latin, Armenian, Georgian, Enclosed letterforms, and Fullwidth letterforms. It now also only does the binary search of the character arrays if we are in three specific ranges where direct calculation is not possible. It early exits as much as possible.
Harley Acheson force-pushed TextObjectCase from 5c2c4b6735 to 6f7aeb4070 2023-04-06 23:31:46 +02:00 Compare
Campbell Barton approved these changes 2023-04-07 05:15:03 +02:00
Campbell Barton reviewed 2023-04-07 05:18:40 +02:00
@ -174,0 +177,4 @@
* mappings so this doesn't work corectly for uppercase Σ (two lowercase forms) and lowercase ß
* won't become "SS".
*/
char32_t BLI_utf32_upper(char32_t wc);

Prefer BLI_str_utf32_char_to_upper / BLI_str_utf32_char_to_lower - which is in keeping with BLI_str_* API.

We could also add a BLI_string_utf32.h, or consider renaming BLI_string_utf8.h to BLI_string_unicode.h since it doesn't make sense to add utf32 functions to a utf8 named header.

Suggest to do this as part of a separate commit though.

Prefer `BLI_str_utf32_char_to_upper` / `BLI_str_utf32_char_to_lower` - which is in keeping with `BLI_str_*` API. We could also add a `BLI_string_utf32.h`, or consider renaming `BLI_string_utf8.h` to `BLI_string_unicode.h` since it doesn't make sense to add utf32 functions to a `utf8` named header. Suggest to do this as part of a separate commit though.
Harley marked this conversation as resolved
Campbell Barton reviewed 2023-04-07 05:19:07 +02:00
@ -402,0 +420,4 @@
if (wc <= U'\x24E9' && wc >= U'\x24D0') { /* Enclosed ⓐ - ⓩ */
return wc - 26;
}
if (wc <= U'\xFF5A' && wc >= U'\xFF41') { /* Fullwidth - */

Avoid unicode in comments a and z are fine here.

Avoid unicode in comments `a` and `z` are fine here.
Harley marked this conversation as resolved
Campbell Barton reviewed 2023-04-07 05:20:09 +02:00
@ -402,0 +417,4 @@
/* Armenian & Georgian */
return wc - 48;
}
if (wc <= U'\x24E9' && wc >= U'\x24D0') { /* Enclosed ⓐ - ⓩ */

Avoid unicode comments (a) and (z) are fine here. (or use their ID as well if that helps).

Avoid unicode comments (a) and (z) are fine here. (or use their ID as well if that helps).
Harley marked this conversation as resolved
Harley Acheson added this to the Module: User Interface project 2023-04-07 19:20:05 +02:00
Harley Acheson force-pushed TextObjectCase from 6f7aeb4070 to 8456d396bc 2023-04-07 23:14:20 +02:00 Compare
Author
Member

@blender-bot build

@blender-bot build
Harley Acheson merged commit e369bf4a6d into main 2023-04-07 23:40:53 +02:00
Harley Acheson deleted branch TextObjectCase 2023-04-07 23:40:54 +02:00
Sign in to join this conversation.
No reviewers
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Code Documentation
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Viewport & EEVEE
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Asset Browser Project
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Asset System
Module
Asset System
Module
Core
Module
Development Management
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Module
Viewport & EEVEE
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Severity
High
Severity
Low
Severity
Normal
Severity
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#106581
No description provided.