diff --git a/src/docs/userguide/utf8.diviner b/src/docs/userguide/utf8.diviner new file mode 100644 index 0000000000..fe7f91dbd9 --- /dev/null +++ b/src/docs/userguide/utf8.diviner @@ -0,0 +1,62 @@ +@title User Guide: UTF-8 and Character Encoding +@group userguide + +How Phabricator handles character encodings. + += Overview = + +Phabricator stores all internal text data as UTF-8, processes all text data +as UTF-8, outputs in UTF-8, and expects all inputs to be UTF-8. Principally, +this means that you should write your source code in UTF-8. In most cases this +does not require you to change anything, because ASCII text is a subset of +UTF-8. + += Detecting and Repairing Files = + +It is recommended that you write source files only in ASCII text, but +Phabricator fully supports UTF-8 source files. However, it won't currently do +encoding transformation, so if you have source files which are not valid UTF-8 +you may run into issues. + +If you have a project which isn't valid UTF-8 because a few files have random +binary nonsense in them, there is a script in libphutil which can help you +identify and fix them: + + project/ $ libphutil/scripts/utils/utf8.php + +Generally, run this script on all source files with "-t" to find files with bad +byte ranges, and then run it without "-t" on each file to identify where there +are problems. For example: + + project/ $ find . -type f -name '*.c' -print0 | xargs -0 -n256 ./utf8 -t + ./hello_world.c + +If this script exits without output, you're in good shape and all the files that +were identified are valid UTF-8. If it found some problems, you need to repair +them. You can identify the specific problems by omitting the "-t" flag: + + project/ $ ./utf8.php hello_world.c + FAIL hello_world.c + + 3 main() + 4 { + 5 printf ("Hello World<0xE9><0xD6>!\n"); + 6 } + 7 + +This shows the offending bytes on line 5 (in the actual console display, they'll +be highlighted). Often a codebase will mostly be valid UTF-8 but have a few +scattered files that have other things in them, like curly quotes which someone +copy-pasted from Word into a comment. In these cases, you can just manually +identify and fix the problems pretty easily. + +If you have a prohibitively large number of UTF-8 issues in your source code, +Phabricator doesn't include any default tools to help you process them in a +systematic way. You could hack up ##utf8.php## as a starting point, or use other +tools to batch-process your source files. + +NOTE: If you have a project which uses a //different encoding// for source +files, there is no easy way to get it working with Phabricator or Arcanist right +now. If it's not reasonable to switch to UTF-8, tell us more about your use case +and we can evaluate supporting it. Since tools like Git don't work well with +other encodings, the prevailing assumption is that this is a rare situation. \ No newline at end of file