Refactor: UTF-8 Character Defines #109163

Merged
Harley Acheson merged 2 commits from Harley/blender:utf8_defines into main 2023-06-26 06:05:26 +02:00
Member

Use defined UTF-8 Universal character names in place of byte escape
sequences and literals.


We have good support for displaying non-ascii Unicode characters in the interface, and we have been increasingly doing so. Current pending examples include #106388 and #108210.

However our uses of these involve different styles and duplication. We use some with literals, like "↓", with escape sequences like "\xe2\x96\xb8" and {0xe2, 0x87, 0xa7, 0x0}, and with universal characters like \u2715.

This PR defines the characters we use in a single place in a consistent way, in a new header BLI_string_utf8_symbols.h . This does seem to make it all much easier to follow and better to extend.

Use defined UTF-8 Universal character names in place of byte escape sequences and literals. --- We have good support for displaying non-ascii Unicode characters in the interface, and we have been increasingly doing so. Current pending examples include #106388 and #108210. However our uses of these involve different styles and duplication. We use some with literals, like "↓", with escape sequences like "\xe2\x96\xb8" and {0xe2, 0x87, 0xa7, 0x0}, and with universal characters like \u2715. This PR defines the characters we use in a single place in a consistent way, in a new header `BLI_string_utf8_symbols.h` . This does seem to make it all much easier to follow and better to extend.
Harley Acheson added this to the User Interface project 2023-06-20 17:33:43 +02:00
Harley Acheson requested review from Campbell Barton 2023-06-20 17:33:52 +02:00
Author
Member

@blender-bot build

@blender-bot build
Member

In theory this sounds like a nice cleanup, however currently it breaks the translations which use macros (N_(), IFACE_(), TIP_()) to extract the messages for translation.

This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros.

"✕ (Ax + B)",
"• Mesh: %s vertices, %s edges, %s faces",
"• Volume",
"• Curve: %s points, %s splines",
"• Edit Curves: %s, %s",
"• Point Cloud: %s points",
"• Instances: %s",

are simply no longer extracted.

I don’t know how to fix this translation issue, short of using a C/C++ preprocessor from Python to identify and replace defines in those strings.

I believe settling on a single way to express Unicode characters and using that consistently would be a better solution.

In theory this sounds like a nice cleanup, however currently it breaks the translations which use macros (`N_()`, `IFACE_()`, `TIP_()`) to extract the messages for translation. This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros. ``` "✕ (Ax + B)", "• Mesh: %s vertices, %s edges, %s faces", "• Volume", "• Curve: %s points, %s splines", "• Edit Curves: %s, %s", "• Point Cloud: %s points", "• Instances: %s", ``` are simply no longer extracted. I don’t know how to fix this translation issue, short of using a C/C++ preprocessor from Python to identify and replace defines in those strings. I believe settling on a single way to express Unicode characters and using that consistently would be a better solution.
Member

BTW here are a few other characters that could be changed if you think it makes sense:

diff --git a/source/blender/editors/space_sequencer/sequencer_edit.cc b/source/blender/editors/space_sequencer/sequencer_edit.cc
index 84f506ff006..d42e9cd6a9b 100644
--- a/source/blender/editors/space_sequencer/sequencer_edit.cc
+++ b/source/blender/editors/space_sequencer/sequencer_edit.cc
@@ -2741,9 +2741,9 @@ void SEQUENCER_OT_swap_data(wmOperatorType *ot)
  * \{ */
 
 static const EnumPropertyItem prop_change_effect_input_types[] = {
-    {0, "A_B", 0, "A -> B", ""},
-    {1, "B_C", 0, "B -> C", ""},
-    {2, "A_C", 0, "A -> C", ""},
+    {0, "A_B", 0, "A → B", ""},
+    {1, "B_C", 0, "B → C", ""},
+    {2, "A_C", 0, "A → C", ""},
     {0, nullptr, 0, nullptr, nullptr},
 };
 
diff --git a/source/blender/makesrna/intern/rna_wm.c b/source/blender/makesrna/intern/rna_wm.c
index 73138049833..573c9950028 100644
--- a/source/blender/makesrna/intern/rna_wm.c
+++ b/source/blender/makesrna/intern/rna_wm.c
@@ -300,10 +300,10 @@ const EnumPropertyItem rna_enum_event_type_items[] = {
     {EVT_PAGEDOWNKEY, "PAGE_DOWN", 0, "Page Down", "PgDown"},
     {EVT_ENDKEY, "END", 0, "End", ""},
     RNA_ENUM_ITEM_SEPR,
-    {EVT_MEDIAPLAY, "MEDIA_PLAY", 0, "Media Play/Pause", ">/||"},
+    {EVT_MEDIAPLAY, "MEDIA_PLAY", 0, "Media Play/Pause", "⏯"},
     {EVT_MEDIASTOP, "MEDIA_STOP", 0, "Media Stop", "Stop"},
-    {EVT_MEDIAFIRST, "MEDIA_FIRST", 0, "Media First", "|<<"},
-    {EVT_MEDIALAST, "MEDIA_LAST", 0, "Media Last", ">>|"},
+    {EVT_MEDIAFIRST, "MEDIA_FIRST", 0, "Media First", "⏮"},
+    {EVT_MEDIALAST, "MEDIA_LAST", 0, "Media Last", "⏭"},
     RNA_ENUM_ITEM_SEPR,
     {KM_TEXTINPUT, "TEXTINPUT", 0, "Text Input", "TxtIn"},
     RNA_ENUM_ITEM_SEPR,

image
image

BTW here are a few other characters that could be changed if you think it makes sense: ```diff diff --git a/source/blender/editors/space_sequencer/sequencer_edit.cc b/source/blender/editors/space_sequencer/sequencer_edit.cc index 84f506ff006..d42e9cd6a9b 100644 --- a/source/blender/editors/space_sequencer/sequencer_edit.cc +++ b/source/blender/editors/space_sequencer/sequencer_edit.cc @@ -2741,9 +2741,9 @@ void SEQUENCER_OT_swap_data(wmOperatorType *ot) * \{ */ static const EnumPropertyItem prop_change_effect_input_types[] = { - {0, "A_B", 0, "A -> B", ""}, - {1, "B_C", 0, "B -> C", ""}, - {2, "A_C", 0, "A -> C", ""}, + {0, "A_B", 0, "A → B", ""}, + {1, "B_C", 0, "B → C", ""}, + {2, "A_C", 0, "A → C", ""}, {0, nullptr, 0, nullptr, nullptr}, }; diff --git a/source/blender/makesrna/intern/rna_wm.c b/source/blender/makesrna/intern/rna_wm.c index 73138049833..573c9950028 100644 --- a/source/blender/makesrna/intern/rna_wm.c +++ b/source/blender/makesrna/intern/rna_wm.c @@ -300,10 +300,10 @@ const EnumPropertyItem rna_enum_event_type_items[] = { {EVT_PAGEDOWNKEY, "PAGE_DOWN", 0, "Page Down", "PgDown"}, {EVT_ENDKEY, "END", 0, "End", ""}, RNA_ENUM_ITEM_SEPR, - {EVT_MEDIAPLAY, "MEDIA_PLAY", 0, "Media Play/Pause", ">/||"}, + {EVT_MEDIAPLAY, "MEDIA_PLAY", 0, "Media Play/Pause", "⏯"}, {EVT_MEDIASTOP, "MEDIA_STOP", 0, "Media Stop", "Stop"}, - {EVT_MEDIAFIRST, "MEDIA_FIRST", 0, "Media First", "|<<"}, - {EVT_MEDIALAST, "MEDIA_LAST", 0, "Media Last", ">>|"}, + {EVT_MEDIAFIRST, "MEDIA_FIRST", 0, "Media First", "⏮"}, + {EVT_MEDIALAST, "MEDIA_LAST", 0, "Media Last", "⏭"}, RNA_ENUM_ITEM_SEPR, {KM_TEXTINPUT, "TEXTINPUT", 0, "Text Input", "TxtIn"}, RNA_ENUM_ITEM_SEPR, ``` ![image](/attachments/11cc0ca2-08a4-441f-a9d4-2e9975ba5a50) ![image](/attachments/a9426016-c4c5-479a-ae17-b5bd3874bebd)
Author
Member

In theory this sounds like a nice cleanup, however currently it breaks the translations which use macros (N_(), IFACE_(), TIP_()) to extract the messages for translation.

Good point. Glad I asked you to check.

This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros.

I only looked quickly, but most of these look fixable. Most are printfs so we could move the bullet from the format string to an argument instead. That doesn't fix "✕ (Ax + B)" though, but this might be worth it even if that one remains an inline universal character?

Might just wait for Campbell to wade in. But yes, I would be okay to just use universal characters everywhere instead. I was surprised how hard it is to tell which characters are which from the encoded byte values. The universal names are so much nicer since they match the unicode 32-bit codepoint value.

> In theory this sounds like a nice cleanup, however currently it breaks the translations which use macros (`N_()`, `IFACE_()`, `TIP_()`) to extract the messages for translation. Good point. Glad I asked you to check. > This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros. I only looked quickly, but most of these look fixable. Most are printfs so we could move the bullet from the format string to an argument instead. That doesn't fix "✕ (Ax + B)" though, but this might be worth it even if that one remains an inline universal character? Might just wait for Campbell to wade in. But yes, I would be okay to just use universal characters everywhere instead. I was surprised how hard it is to tell which characters are which from the encoded byte values. The universal names are so much nicer since they match the unicode 32-bit codepoint value.
Member

we could move the bullet from the format string to an argument instead.

I’d be a bit wary of that, nothing guarantees that the bullet point is the proper character to use in all languages. Looking at the Japanese translation, they use "・" instead of "•". (Also in French, the character traditionally used for list items is the em dash, though the bullet point is increasingly used because it’s the default option of word processors, so really it’s acceptable.)

That doesn't fix "✕ (Ax + B)" though, but this might be worth it even if that one remains an inline universal character?

Yes, in many instances it may still be worth it, but I fear new translation issues could be introduced later if this define system becomes part of the style guidelines.

> we could move the bullet from the format string to an argument instead. I’d be a bit wary of that, nothing guarantees that the bullet point is the proper character to use in all languages. Looking at the Japanese translation, they use "・" instead of "•". (Also in French, the character traditionally used for list items is the em dash, though the bullet point is increasingly used because it’s the default option of word processors, so really it’s acceptable.) > That doesn't fix "✕ (Ax + B)" though, but this might be worth it even if that one remains an inline universal character? Yes, in many instances it may still be worth it, but I fear new translation issues could be introduced later if this define system becomes part of the style guidelines.
Harley Acheson force-pushed utf8_defines from e0bdfee1e7 to 409601e0e2 2023-06-21 00:33:28 +02:00 Compare
Author
Member

I’d be a bit wary of that, nothing guarantees that the bullet point is the proper character to use in all languages. Looking at the Japanese translation, they use "・" instead of "•". (Also in French, the character traditionally used for list items is the em dash, though the bullet point is increasingly used because it’s the default option of word processors, so really it’s acceptable.)

Yes, I should have thought of that.

You are probably right. But... is that a different problem though? I mean this use of Unicode bullet is happening in just one file. And that file has five usages that are translated and three that are not, so would be a mishmash if it does differ by language.

Maybe we need a #define UI_BULLET_CHAR TIP_("\u2022") somewhere?

> I’d be a bit wary of that, nothing guarantees that the bullet point is the proper character to use in all languages. Looking at the Japanese translation, they use "・" instead of "•". (Also in French, the character traditionally used for list items is the em dash, though the bullet point is increasingly used because it’s the default option of word processors, so really it’s acceptable.) Yes, I should have thought of that. You are probably right. But... is that a different problem though? I mean this use of Unicode bullet is happening in just one file. And that file has five usages that are translated and _three that are not_, so would be a mishmash if it does differ by language. Maybe we need a #define UI_BULLET_CHAR TIP_("\u2022") somewhere?

In general this seems fine although a table of named utf8 defines doesn't have so much in common with a UTF8 API.

This could be a separate header: e.g. BLI_string_utf8_symbols.h.

In general this seems fine although a table of named utf8 defines doesn't have so much in common with a UTF8 API. This could be a separate header: e.g. `BLI_string_utf8_symbols.h`.
Campbell Barton requested changes 2023-06-21 05:54:40 +02:00
Campbell Barton left a comment
Owner

Requesting a separate header, otherwise LGTM.

Requesting a separate header, otherwise LGTM.
Member

You are probably right. But... is that a different problem though? I mean this use of Unicode bullet is happening in just one file. And that file has five usages that are translated and three that are not, so would be a mishmash if it does differ by language.

Ooh, good catch! Yes, this should indeed be fixed, I’ll add it to my list for later.

Maybe we need a #define UI_BULLET_CHAR TIP_("\u2022") somewhere?

If the only uses for the bullet point are in this file, it could work, but to me it doesn’t seem great. Firstly there is no clear benefit for translators because upon extraction the escaped character is converted to Unicode, so we see the actual bullet point instead of "\u2022". Secondly, this character is part of the message so it is useful for us to see it in the .po file in its entirety, as it gives context.

In addition, at one point I wanted to translate a single character, but for @mont29 it was a bad idea for multiple reasons, including performance. This might be another such situation.

> You are probably right. But... is that a different problem though? I mean this use of Unicode bullet is happening in just one file. And that file has five usages that are translated and _three that are not_, so would be a mishmash if it does differ by language. Ooh, good catch! Yes, this should indeed be fixed, I’ll add it to my list for later. > Maybe we need a #define UI_BULLET_CHAR TIP_("\u2022") somewhere? If the only uses for the bullet point are in this file, it could work, but to me it doesn’t seem great. Firstly there is no clear benefit for translators because upon extraction the escaped character is converted to Unicode, so we see the actual bullet point instead of "\u2022". Secondly, this character is part of the message so it is useful for us to see it in the .po file in its entirety, as it gives context. In addition, at [one point](https://archive.blender.org/developer/differential/0015/0015392/D15392.id53353.html#inline-132006) I wanted to translate a single character, but for @mont29 it was a bad idea for multiple reasons, including performance. This might be another such situation.
Author
Member

it breaks the translations which use macros (N_(), IFACE_(), TIP_()) to extract the messages for translation...This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros.

Are you sure? I thought that only applied to N_(), which does nothing and just acts as a translation marker. The others, like IFACE_(), TIP_() look to return a translated string when called using BLT_pgettext. So those should work with the bullets in the strings as I had it earlier?

> it breaks the translations which use macros (`N_()`, `IFACE_()`, `TIP_()`) to extract the messages for translation...This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros. Are you sure? I thought that only applied to `N_()`, which does nothing and just acts as a translation marker. The others, like `IFACE_()`, `TIP_()` look to return a translated string when called using `BLT_pgettext`. So those should work with the bullets in the strings as I had it earlier?
Member

Are you sure? I thought that only applied to N_(), which does nothing and just acts as a translation marker. The others, like IFACE_(), TIP_() look to return a translated string when called using BLT_pgettext.

I am definitely sure: these both do the translation and extract the message to the .po files. Take a look at bl_i18n_utils/settings.py for more detail on which patterns from the source code get extracted.

So those should work with the bullets in the strings as I had it earlier?

I tested this PR yesterday and unless I did something wrong, the messages I mentioned were all of those that disappeared [EDIT: disappeared from the .po files] after applying the patch and updating the .po files using the UI translation add-on.

> Are you sure? I thought that only applied to `N_()`, which does nothing and just acts as a translation marker. The others, like `IFACE_()`, `TIP_()` look to return a translated string when called using `BLT_pgettext`. I am definitely sure: these both do the translation and extract the message to the .po files. Take a look at [bl_i18n_utils/settings.py](https://projects.blender.org/blender/blender/src/branch/main/scripts/modules/bl_i18n_utils/settings.py#L241) for more detail on which patterns from the source code get extracted. > So those should work with the bullets in the strings as I had it earlier? I tested this PR yesterday and unless I did something wrong, the messages I mentioned were all of those that disappeared [EDIT: disappeared from the .po files] after applying the patch and updating the .po files using the UI translation add-on.
Author
Member

@pioverfour - disappeared from the .po files...

Yes, I was not thinking about that part of it. Thanks for being patient with my lack of knowledge there.

BTW here are a few other characters that could be changed if you think it makes sense:

Yes, those look like a great idea. But would have to be a separate change.

So... how to proceed?

I think we are all in agreement that using universal character format is nicer. And Campbell is okay with it and the naming, but wants this in a new header.

What do you think of me going forward but without making any changes to node_draw.cc? That one is the only with the mentioned translations issues and it is using universal characters anyway? This PR would still be cleaning up a lot of (my) mess and encourages future uses to be a bit cleaner.

> @pioverfour - disappeared from the .po files... Yes, I was not thinking about that part of it. Thanks for being patient with my lack of knowledge there. > BTW here are a few other characters that could be changed if you think it makes sense: Yes, those look like a great idea. But would have to be a separate change. So... how to proceed? I think we are all in agreement that using universal character format is nicer. And Campbell is okay with it and the naming, but wants this in a new header. What do you think of me going forward but without making any changes to `node_draw.cc`? That one is the only with the mentioned translations issues and it is using universal characters anyway? This PR would still be cleaning up a lot of (my) mess and encourages future uses to be a bit cleaner.
Member

Thanks for being patient with my lack of knowledge there.

Others are so patient with mine 😅

What do you think of me going forward but without making any changes to node_draw.cc?

Sounds good to me!

> Thanks for being patient with my lack of knowledge there. Others are so patient with mine 😅 > What do you think of me going forward but without making any changes to `node_draw.cc`? Sounds good to me!
Harley Acheson force-pushed utf8_defines from 409601e0e2 to ed08af5532 2023-06-24 18:32:57 +02:00 Compare
Campbell Barton requested changes 2023-06-26 02:53:09 +02:00
Campbell Barton left a comment
Owner

Requesting removal of ASCII UTF32 code-points.

Requesting removal of ASCII UTF32 code-points.
@ -0,0 +42,4 @@
/* Unicode characters as UTF-32 codepoints. Last portion should include the official assigned name.
* Please do not add defines here that are not actually in use. */
#define BLI_STR_UTF32_SPACE U'\u0020' /* */

There doesn't seem to be much overall benefit to include ASCII-UTF32 code-points (which can be written as plain-text).
If a developer needs to use 10+ more characters would they would be expected to add every one as a define here... come up with unambiguous names for each. If the define becomes unused ... we have to remember to remove it. It seems like unnecessary busywork & added ambiguity since for e.g. it's not so obvious which direction a SLASH is & so it may be with other ASCII characters.

Prefer to keep inline uint32_t(' '), uint32_t('/') as-is.

There doesn't seem to be much overall benefit to include ASCII-UTF32 code-points (which can be written as plain-text). If a developer needs to use 10+ more characters would they would be expected to add every one as a define here... come up with unambiguous names for each. If the define becomes unused ... we have to remember to remove it. It seems like unnecessary busywork & added ambiguity since for e.g. it's not so obvious which direction a SLASH is & so it may be with other ASCII characters. Prefer to keep inline `uint32_t(' ')`, `uint32_t('/')` as-is.
Harley Acheson added 1 commit 2023-06-26 05:18:46 +02:00
buildbot/vexp-code-patch-coordinator Build done. Details
335bf9222e
Removing UTF32 defines
Author
Member

@ideasman42 - Requesting removal of ASCII UTF32 code-points.

Done. Yes, they weren't doing much.

> @ideasman42 - Requesting removal of ASCII UTF32 code-points. Done. Yes, they weren't doing much.
Campbell Barton approved these changes 2023-06-26 05:48:17 +02:00
Author
Member

@blender-bot build

@blender-bot build
Harley Acheson merged commit 4a80d0b6d5 into main 2023-06-26 06:05:26 +02:00
Harley Acheson deleted branch utf8_defines 2023-06-26 06:05:27 +02:00
Sign in to join this conversation.
No reviewers
No Label
Interest
Alembic
Interest
Animation & Rigging
Interest
Asset Browser
Interest
Asset Browser Project Overview
Interest
Audio
Interest
Automated Testing
Interest
Blender Asset Bundle
Interest
BlendFile
Interest
Collada
Interest
Compatibility
Interest
Compositing
Interest
Core
Interest
Cycles
Interest
Dependency Graph
Interest
Development Management
Interest
EEVEE
Interest
EEVEE & Viewport
Interest
Freestyle
Interest
Geometry Nodes
Interest
Grease Pencil
Interest
ID Management
Interest
Images & Movies
Interest
Import Export
Interest
Line Art
Interest
Masking
Interest
Metal
Interest
Modeling
Interest
Modifiers
Interest
Motion Tracking
Interest
Nodes & Physics
Interest
OpenGL
Interest
Overlay
Interest
Overrides
Interest
Performance
Interest
Physics
Interest
Pipeline, Assets & IO
Interest
Platforms, Builds & Tests
Interest
Python API
Interest
Render & Cycles
Interest
Render Pipeline
Interest
Sculpt, Paint & Texture
Interest
Text Editor
Interest
Translations
Interest
Triaging
Interest
Undo
Interest
USD
Interest
User Interface
Interest
UV Editing
Interest
VFX & Video
Interest
Video Sequencer
Interest
Virtual Reality
Interest
Vulkan
Interest
Wayland
Interest
Workbench
Interest: X11
Legacy
Blender 2.8 Project
Legacy
Milestone 1: Basic, Local Asset Browser
Legacy
OpenGL Error
Meta
Good First Issue
Meta
Papercut
Meta
Retrospective
Meta
Security
Module
Animation & Rigging
Module
Core
Module
Development Management
Module
EEVEE & Viewport
Module
Grease Pencil
Module
Modeling
Module
Nodes & Physics
Module
Pipeline, Assets & IO
Module
Platforms, Builds & Tests
Module
Python API
Module
Render & Cycles
Module
Sculpt, Paint & Texture
Module
Triaging
Module
User Interface
Module
VFX & Video
Platform
FreeBSD
Platform
Linux
Platform
macOS
Platform
Windows
Priority
High
Priority
Low
Priority
Normal
Priority
Unbreak Now!
Status
Archived
Status
Confirmed
Status
Duplicate
Status
Needs Info from Developers
Status
Needs Information from User
Status
Needs Triage
Status
Resolved
Type
Bug
Type
Design
Type
Known Issue
Type
Patch
Type
Report
Type
To Do
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: blender/blender#109163
No description provided.