Use unicode mode when tokenizing strings like user realnames
Summary:
Fixes T9732. We currently tokenize strings (like user realnames) in the default non-unicode mode, which can cause patterns like `\s` to work incorrectly.
Use `/u` to use unicode-aware tokenization instead.
Test Plan:
The behavior of "\s" depends upon environmental settings like LC_ALL.
With LC_ALL set to "C", `\xA0` is not considered a whitespace character.
With LC_ALL set to "en_US", it is:
```
$ php -r 'setlocale(LC_ALL, "C"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";'
1
$ php -r 'setlocale(LC_ALL, "en_US"); echo count(preg_split("/\s/", "\xE5\xBF\xA0")) . "\n";'
2
```
To reproduce the original issue, I added an explicit:
```
setlocale(LC_ALL, "en_US");
```
...call before the `preg_split()` call. This caused "忠" to be improperly split.
I then added "/u", and observed proper tokenization.
Reviewers: chad
Reviewed By: chad
Subscribers: qiu8310
Maniphest Tasks: T9732
Differential Revision: https://secure.phabricator.com/D14441
			
			
This commit is contained in:
		| @@ -107,7 +107,7 @@ abstract class PhabricatorTypeaheadDatasource extends Phobject { | ||||
|       return array(); | ||||
|     } | ||||
|  | ||||
|     $tokens = preg_split('/\s+|[-\[\]]/', $string); | ||||
|     $tokens = preg_split('/\s+|[-\[\]]/u', $string); | ||||
|     return array_unique($tokens); | ||||
|   } | ||||
|  | ||||
|   | ||||
		Reference in New Issue
	
	Block a user
	 epriestley
					epriestley