Refactor: UTF-8 Character Defines #109163

Harley Acheson · 2023-06-20T17:33:23+02:00

Harley Acheson commented

2023-06-20 17:33:23 +02:00

Use defined UTF-8 Universal character names in place of byte escape
sequences and literals.

We have good support for displaying non-ascii Unicode characters in the interface, and we have been increasingly doing so. Current pending examples include #106388 and #108210.

However our uses of these involve different styles and duplication. We use some with literals, like "↓", with escape sequences like "\xe2\x96\xb8" and {0xe2, 0x87, 0xa7, 0x0}, and with universal characters like \u2715.

This PR defines the characters we use in a single place in a consistent way, in a new header BLI_string_utf8_symbols.h . This does seem to make it all much easier to follow and better to extend.

Use defined UTF-8 Universal character names in place of byte escape sequences and literals. --- We have good support for displaying non-ascii Unicode characters in the interface, and we have been increasingly doing so. Current pending examples include #106388 and #108210. However our uses of these involve different styles and duplication. We use some with literals, like "↓", with escape sequences like "\xe2\x96\xb8" and {0xe2, 0x87, 0xa7, 0x0}, and with universal characters like \u2715. This PR defines the characters we use in a single place in a consistent way, in a new header `BLI_string_utf8_symbols.h` . This does seem to make it all much easier to follow and better to extend.

Harley Acheson added this to the User Interface project 2023-06-20 17:33:43 +02:00

Harley Acheson requested review from Campbell Barton 2023-06-20 17:33:52 +02:00

Harley Acheson commented

2023-06-20 18:19:28 +02:00

@blender-bot build

Harley Acheson referenced this pull request

2023-06-20 22:03:11 +02:00

UI: replace "x" with multiplication sign when displaying calculations #106388

Damien Picard commented

2023-06-20 22:59:07 +02:00

In theory this sounds like a nice cleanup, however currently it breaks the translations which use macros (N_(), IFACE_(), TIP_()) to extract the messages for translation.

This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros.

"✕ (Ax + B)",
"• Mesh: %s vertices, %s edges, %s faces",
"• Volume",
"• Curve: %s points, %s splines",
"• Edit Curves: %s, %s",
"• Point Cloud: %s points",
"• Instances: %s",

are simply no longer extracted.

I don’t know how to fix this translation issue, short of using a C/C++ preprocessor from Python to identify and replace defines in those strings.

I believe settling on a single way to express Unicode characters and using that consistently would be a better solution.

In theory this sounds like a nice cleanup, however currently it breaks the translations which use macros (`N_()`, `IFACE_()`, `TIP_()`) to extract the messages for translation. This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros. ``` "✕ (Ax + B)", "• Mesh: %s vertices, %s edges, %s faces", "• Volume", "• Curve: %s points, %s splines", "• Edit Curves: %s, %s", "• Point Cloud: %s points", "• Instances: %s", ``` are simply no longer extracted. I don’t know how to fix this translation issue, short of using a C/C++ preprocessor from Python to identify and replace defines in those strings. I believe settling on a single way to express Unicode characters and using that consistently would be a better solution.

Damien Picard referenced this pull request

2023-06-20 23:11:45 +02:00

UI: replace "x" with multiplication sign when displaying calculations #106388

Damien Picard commented

2023-06-20 23:21:18 +02:00

BTW here are a few other characters that could be changed if you think it makes sense:

diff --git a/source/blender/editors/space_sequencer/sequencer_edit.cc b/source/blender/editors/space_sequencer/sequencer_edit.cc
index 84f506ff006..d42e9cd6a9b 100644
--- a/source/blender/editors/space_sequencer/sequencer_edit.cc
+++ b/source/blender/editors/space_sequencer/sequencer_edit.cc
@@ -2741,9 +2741,9 @@ void SEQUENCER_OT_swap_data(wmOperatorType *ot)
  * \{ */
 
 static const EnumPropertyItem prop_change_effect_input_types[] = {
-    {0, "A_B", 0, "A -> B", ""},
-    {1, "B_C", 0, "B -> C", ""},
-    {2, "A_C", 0, "A -> C", ""},
+    {0, "A_B", 0, "A → B", ""},
+    {1, "B_C", 0, "B → C", ""},
+    {2, "A_C", 0, "A → C", ""},
     {0, nullptr, 0, nullptr, nullptr},
 };
 
diff --git a/source/blender/makesrna/intern/rna_wm.c b/source/blender/makesrna/intern/rna_wm.c
index 73138049833..573c9950028 100644
--- a/source/blender/makesrna/intern/rna_wm.c
+++ b/source/blender/makesrna/intern/rna_wm.c
@@ -300,10 +300,10 @@ const EnumPropertyItem rna_enum_event_type_items[] = {
     {EVT_PAGEDOWNKEY, "PAGE_DOWN", 0, "Page Down", "PgDown"},
     {EVT_ENDKEY, "END", 0, "End", ""},
     RNA_ENUM_ITEM_SEPR,
-    {EVT_MEDIAPLAY, "MEDIA_PLAY", 0, "Media Play/Pause", ">/||"},
+    {EVT_MEDIAPLAY, "MEDIA_PLAY", 0, "Media Play/Pause", "⏯"},
     {EVT_MEDIASTOP, "MEDIA_STOP", 0, "Media Stop", "Stop"},
-    {EVT_MEDIAFIRST, "MEDIA_FIRST", 0, "Media First", "|<<"},
-    {EVT_MEDIALAST, "MEDIA_LAST", 0, "Media Last", ">>|"},
+    {EVT_MEDIAFIRST, "MEDIA_FIRST", 0, "Media First", "⏮"},
+    {EVT_MEDIALAST, "MEDIA_LAST", 0, "Media Last", "⏭"},
     RNA_ENUM_ITEM_SEPR,
     {KM_TEXTINPUT, "TEXTINPUT", 0, "Text Input", "TxtIn"},
     RNA_ENUM_ITEM_SEPR,

BTW here are a few other characters that could be changed if you think it makes sense: ```diff diff --git a/source/blender/editors/space_sequencer/sequencer_edit.cc b/source/blender/editors/space_sequencer/sequencer_edit.cc index 84f506ff006..d42e9cd6a9b 100644 --- a/source/blender/editors/space_sequencer/sequencer_edit.cc +++ b/source/blender/editors/space_sequencer/sequencer_edit.cc @@ -2741,9 +2741,9 @@ void SEQUENCER_OT_swap_data(wmOperatorType *ot) * \{ */ static const EnumPropertyItem prop_change_effect_input_types[] = { - {0, "A_B", 0, "A -> B", ""}, - {1, "B_C", 0, "B -> C", ""}, - {2, "A_C", 0, "A -> C", ""}, + {0, "A_B", 0, "A → B", ""}, + {1, "B_C", 0, "B → C", ""}, + {2, "A_C", 0, "A → C", ""}, {0, nullptr, 0, nullptr, nullptr}, }; diff --git a/source/blender/makesrna/intern/rna_wm.c b/source/blender/makesrna/intern/rna_wm.c index 73138049833..573c9950028 100644 --- a/source/blender/makesrna/intern/rna_wm.c +++ b/source/blender/makesrna/intern/rna_wm.c @@ -300,10 +300,10 @@ const EnumPropertyItem rna_enum_event_type_items[] = { {EVT_PAGEDOWNKEY, "PAGE_DOWN", 0, "Page Down", "PgDown"}, {EVT_ENDKEY, "END", 0, "End", ""}, RNA_ENUM_ITEM_SEPR, - {EVT_MEDIAPLAY, "MEDIA_PLAY", 0, "Media Play/Pause", ">/||"}, + {EVT_MEDIAPLAY, "MEDIA_PLAY", 0, "Media Play/Pause", "⏯"}, {EVT_MEDIASTOP, "MEDIA_STOP", 0, "Media Stop", "Stop"}, - {EVT_MEDIAFIRST, "MEDIA_FIRST", 0, "Media First", "|<<"}, - {EVT_MEDIALAST, "MEDIA_LAST", 0, "Media Last", ">>|"}, + {EVT_MEDIAFIRST, "MEDIA_FIRST", 0, "Media First", "⏮"}, + {EVT_MEDIALAST, "MEDIA_LAST", 0, "Media Last", "⏭"}, RNA_ENUM_ITEM_SEPR, {KM_TEXTINPUT, "TEXTINPUT", 0, "Text Input", "TxtIn"}, RNA_ENUM_ITEM_SEPR, ``` ![image](/attachments/11cc0ca2-08a4-441f-a9d4-2e9975ba5a50) ![image](/attachments/a9426016-c4c5-479a-ae17-b5bd3874bebd)

image.png

16 KiB

image.png

20 KiB

Harley Acheson commented

2023-06-20 23:30:00 +02:00

In theory this sounds like a nice cleanup, however currently it breaks the translations which use macros (N_(), IFACE_(), TIP_()) to extract the messages for translation.

Good point. Glad I asked you to check.

This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros.

I only looked quickly, but most of these look fixable. Most are printfs so we could move the bullet from the format string to an argument instead. That doesn't fix "✕ (Ax + B)" though, but this might be worth it even if that one remains an inline universal character?

Might just wait for Campbell to wade in. But yes, I would be okay to just use universal characters everywhere instead. I was surprised how hard it is to tell which characters are which from the encoded byte values. The universal names are so much nicer since they match the unicode 32-bit codepoint value.

> In theory this sounds like a nice cleanup, however currently it breaks the translations which use macros (`N_()`, `IFACE_()`, `TIP_()`) to extract the messages for translation. Good point. Glad I asked you to check. > This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros. I only looked quickly, but most of these look fixable. Most are printfs so we could move the bullet from the format string to an argument instead. That doesn't fix "✕ (Ax + B)" though, but this might be worth it even if that one remains an inline universal character? Might just wait for Campbell to wade in. But yes, I would be okay to just use universal characters everywhere instead. I was surprised how hard it is to tell which characters are which from the encoded byte values. The universal names are so much nicer since they match the unicode 32-bit codepoint value.

👍 1

Damien Picard commented

2023-06-21 00:07:54 +02:00

we could move the bullet from the format string to an argument instead.

I’d be a bit wary of that, nothing guarantees that the bullet point is the proper character to use in all languages. Looking at the Japanese translation, they use "・" instead of "•". (Also in French, the character traditionally used for list items is the em dash, though the bullet point is increasingly used because it’s the default option of word processors, so really it’s acceptable.)

That doesn't fix "✕ (Ax + B)" though, but this might be worth it even if that one remains an inline universal character?

Yes, in many instances it may still be worth it, but I fear new translation issues could be introduced later if this define system becomes part of the style guidelines.

> we could move the bullet from the format string to an argument instead. I’d be a bit wary of that, nothing guarantees that the bullet point is the proper character to use in all languages. Looking at the Japanese translation, they use "・" instead of "•". (Also in French, the character traditionally used for list items is the em dash, though the bullet point is increasingly used because it’s the default option of word processors, so really it’s acceptable.) > That doesn't fix "✕ (Ax + B)" though, but this might be worth it even if that one remains an inline universal character? Yes, in many instances it may still be worth it, but I fear new translation issues could be introduced later if this define system becomes part of the style guidelines.

Harley Acheson force-pushed utf8_defines from e0bdfee1e7 to 409601e0e2

2023-06-21 00:33:28 +02:00

Compare

Harley Acheson commented

2023-06-21 00:40:33 +02:00

I’d be a bit wary of that, nothing guarantees that the bullet point is the proper character to use in all languages. Looking at the Japanese translation, they use "・" instead of "•". (Also in French, the character traditionally used for list items is the em dash, though the bullet point is increasingly used because it’s the default option of word processors, so really it’s acceptable.)

Yes, I should have thought of that.

You are probably right. But... is that a different problem though? I mean this use of Unicode bullet is happening in just one file. And that file has five usages that are translated and three that are not, so would be a mishmash if it does differ by language.

Maybe we need a #define UI_BULLET_CHAR TIP_("\u2022") somewhere?

> I’d be a bit wary of that, nothing guarantees that the bullet point is the proper character to use in all languages. Looking at the Japanese translation, they use "・" instead of "•". (Also in French, the character traditionally used for list items is the em dash, though the bullet point is increasingly used because it’s the default option of word processors, so really it’s acceptable.) Yes, I should have thought of that. You are probably right. But... is that a different problem though? I mean this use of Unicode bullet is happening in just one file. And that file has five usages that are translated and _three that are not_, so would be a mishmash if it does differ by language. Maybe we need a #define UI_BULLET_CHAR TIP_("\u2022") somewhere?

Campbell Barton commented

2023-06-21 05:54:07 +02:00

In general this seems fine although a table of named utf8 defines doesn't have so much in common with a UTF8 API.

This could be a separate header: e.g. BLI_string_utf8_symbols.h.

In general this seems fine although a table of named utf8 defines doesn't have so much in common with a UTF8 API. This could be a separate header: e.g. `BLI_string_utf8_symbols.h`.

Campbell Barton requested changes 2023-06-21 05:54:40 +02:00

Campbell Barton left a comment

Requesting a separate header, otherwise LGTM.

Damien Picard commented

2023-06-21 23:58:14 +02:00

You are probably right. But... is that a different problem though? I mean this use of Unicode bullet is happening in just one file. And that file has five usages that are translated and three that are not, so would be a mishmash if it does differ by language.

Ooh, good catch! Yes, this should indeed be fixed, I’ll add it to my list for later.

Maybe we need a #define UI_BULLET_CHAR TIP_("\u2022") somewhere?

If the only uses for the bullet point are in this file, it could work, but to me it doesn’t seem great. Firstly there is no clear benefit for translators because upon extraction the escaped character is converted to Unicode, so we see the actual bullet point instead of "\u2022". Secondly, this character is part of the message so it is useful for us to see it in the .po file in its entirety, as it gives context.

In addition, at one point I wanted to translate a single character, but for @mont29 it was a bad idea for multiple reasons, including performance. This might be another such situation.

> You are probably right. But... is that a different problem though? I mean this use of Unicode bullet is happening in just one file. And that file has five usages that are translated and _three that are not_, so would be a mishmash if it does differ by language. Ooh, good catch! Yes, this should indeed be fixed, I’ll add it to my list for later. > Maybe we need a #define UI_BULLET_CHAR TIP_("\u2022") somewhere? If the only uses for the bullet point are in this file, it could work, but to me it doesn’t seem great. Firstly there is no clear benefit for translators because upon extraction the escaped character is converted to Unicode, so we see the actual bullet point instead of "\u2022". Secondly, this character is part of the message so it is useful for us to see it in the .po file in its entirety, as it gives context. In addition, at [one point](https://archive.blender.org/developer/differential/0015/0015392/D15392.id53353.html#inline-132006) I wanted to translate a single character, but for @mont29 it was a bad idea for multiple reasons, including performance. This might be another such situation.

Harley Acheson commented

2023-06-22 00:29:30 +02:00

it breaks the translations which use macros (N_(), IFACE_(), TIP_()) to extract the messages for translation...This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros.

Are you sure? I thought that only applied to N_(), which does nothing and just acts as a translation marker. The others, like IFACE_(), TIP_() look to return a translated string when called using BLT_pgettext. So those should work with the bullets in the strings as I had it earlier?

> it breaks the translations which use macros (`N_()`, `IFACE_()`, `TIP_()`) to extract the messages for translation...This is because the extraction for these messages uses regexes to parse source files, and there is no preprocessing to evaluate macros. Are you sure? I thought that only applied to `N_()`, which does nothing and just acts as a translation marker. The others, like `IFACE_()`, `TIP_()` look to return a translated string when called using `BLT_pgettext`. So those should work with the bullets in the strings as I had it earlier?

Damien Picard commented

2023-06-22 00:56:39 +02:00

Are you sure? I thought that only applied to N_(), which does nothing and just acts as a translation marker. The others, like IFACE_(), TIP_() look to return a translated string when called using BLT_pgettext.

I am definitely sure: these both do the translation and extract the message to the .po files. Take a look at bl_i18n_utils/settings.py for more detail on which patterns from the source code get extracted.

So those should work with the bullets in the strings as I had it earlier?

I tested this PR yesterday and unless I did something wrong, the messages I mentioned were all of those that disappeared [EDIT: disappeared from the .po files] after applying the patch and updating the .po files using the UI translation add-on.

> Are you sure? I thought that only applied to `N_()`, which does nothing and just acts as a translation marker. The others, like `IFACE_()`, `TIP_()` look to return a translated string when called using `BLT_pgettext`. I am definitely sure: these both do the translation and extract the message to the .po files. Take a look at [bl_i18n_utils/settings.py](https://projects.blender.org/blender/blender/src/branch/main/scripts/modules/bl_i18n_utils/settings.py#L241) for more detail on which patterns from the source code get extracted. > So those should work with the bullets in the strings as I had it earlier? I tested this PR yesterday and unless I did something wrong, the messages I mentioned were all of those that disappeared [EDIT: disappeared from the .po files] after applying the patch and updating the .po files using the UI translation add-on.

Harley Acheson commented

2023-06-22 17:35:15 +02:00

@pioverfour - disappeared from the .po files...

Yes, I was not thinking about that part of it. Thanks for being patient with my lack of knowledge there.

BTW here are a few other characters that could be changed if you think it makes sense:

Yes, those look like a great idea. But would have to be a separate change.

So... how to proceed?

I think we are all in agreement that using universal character format is nicer. And Campbell is okay with it and the naming, but wants this in a new header.

What do you think of me going forward but without making any changes to node_draw.cc? That one is the only with the mentioned translations issues and it is using universal characters anyway? This PR would still be cleaning up a lot of (my) mess and encourages future uses to be a bit cleaner.

> @pioverfour - disappeared from the .po files... Yes, I was not thinking about that part of it. Thanks for being patient with my lack of knowledge there. > BTW here are a few other characters that could be changed if you think it makes sense: Yes, those look like a great idea. But would have to be a separate change. So... how to proceed? I think we are all in agreement that using universal character format is nicer. And Campbell is okay with it and the naming, but wants this in a new header. What do you think of me going forward but without making any changes to `node_draw.cc`? That one is the only with the mentioned translations issues and it is using universal characters anyway? This PR would still be cleaning up a lot of (my) mess and encourages future uses to be a bit cleaner.

Damien Picard commented

2023-06-23 01:12:07 +02:00

Thanks for being patient with my lack of knowledge there.

Others are so patient with mine 😅

What do you think of me going forward but without making any changes to node_draw.cc?

Sounds good to me!

> Thanks for being patient with my lack of knowledge there. Others are so patient with mine 😅 > What do you think of me going forward but without making any changes to `node_draw.cc`? Sounds good to me!

Harley Acheson force-pushed utf8_defines from 409601e0e2 to ed08af5532

2023-06-24 18:32:57 +02:00

Compare

Campbell Barton requested changes 2023-06-26 02:53:09 +02:00

Campbell Barton left a comment

Requesting removal of ASCII UTF32 code-points.

source/blender/blenlib/BLI_string_utf8_symbols.h Outdated

						
				@ -0,0 +42,4 @@

				/* Unicode characters as UTF-32 codepoints. Last portion should include the official assigned name.

				 * Please do not add defines here that are not actually in use.  */

				#define BLI_STR_UTF32_SPACE U'\u0020'                               /*   */

Campbell Barton commented

2023-06-26 02:50:40 +02:00

There doesn't seem to be much overall benefit to include ASCII-UTF32 code-points (which can be written as plain-text).
If a developer needs to use 10+ more characters would they would be expected to add every one as a define here... come up with unambiguous names for each. If the define becomes unused ... we have to remember to remove it. It seems like unnecessary busywork & added ambiguity since for e.g. it's not so obvious which direction a SLASH is & so it may be with other ASCII characters.

Prefer to keep inline uint32_t(' '), uint32_t('/') as-is.

There doesn't seem to be much overall benefit to include ASCII-UTF32 code-points (which can be written as plain-text). If a developer needs to use 10+ more characters would they would be expected to add every one as a define here... come up with unambiguous names for each. If the define becomes unused ... we have to remember to remove it. It seems like unnecessary busywork & added ambiguity since for e.g. it's not so obvious which direction a SLASH is & so it may be with other ASCII characters. Prefer to keep inline `uint32_t(' ')`, `uint32_t('/')` as-is.

Harley Acheson added 1 commit 2023-06-26 05:18:46 +02:00