[TOC]
# 26. Unicode in ES6
This chapter explains the improved support for Unicode that ECMAScript 6 brings. For a general introduction to Unicode, read Chap. “[Unicode and JavaScript](http://speakingjs.com/es5/ch24.html)” in “Speaking JavaScript”.
## 26.1 Unicode is better supported in ES6
There are three areas in which ECMAScript 6 has improved support for Unicode:
- Unicode escapes for code points beyond 16 bits: `\u{···}`
Can be used in identifiers, string literals, template literals and regular expression literals. They are explained in the next section.
- [Strings](ch_strings.html#ch_strings):
- Iteration honors Unicode code points.
- Read code point values via `String.prototype.codePointAt()`.
- Create a string from code point values via `String.fromCodePoint()`.
- [Regular expressions](ch_regexp.html#ch_regexp):
- New flag `/u` (plus boolean property `unicode`) improves handling of surrogate pairs.
Additionally, ES6 is based on Unicode version 5.1.0, whereas ES5 is based on Unicode version 3.0.
## 26.2 Escape sequences in ES6
There are three parameterized escape sequences for representing characters in JavaScript:
- Hex escape (exactly two hexadecimal digits): `\xHH`
```
`` > '\x7A' === 'z'
true
```
- Unicode escape (exactly four hexadecimal digits): `\uHHHH`
```
`` > '\u007A' === 'z'
true
```
- Unicode code point escape (1 or more hexadecimal digits): `\u{···}`
```
`` > '\u{7A}' === 'z'
true
```
Unicode code point escapes are new in ES6. They let you specify code points beyond 16 bits. If you wanted to do that in ECMAScript 5, you had to encode each code point as two UTF-16 code units (a *surrogate pair*). These code units could be expressed via Unicode escapes. For example, the following statement logs a rocket (code point 0x1F680) to most consoles:
```console``.``log``(``'\uD83D\uDE80'``);`
With a Unicode code point escape you can specify code points greater than 16 bits directly:
```console``.``log``(``'\u{1F680}'``);`
### 26.2.1 Where can escape sequences be used?
The escape sequences can be used in the following locations:
`\uHHHH` `\u{···}` `\xHH` Identifiers ✔ ✔ String literals ✔ ✔ ✔ Template literals ✔ ✔ ✔ Regular expression literals ✔ Only with flag `/u` ✔Identifiers:
- A 4-digit Unicode escape `\uHHHH` becomes a single code point.
- A Unicode code point escape `\u{···}` becomes a single code point.
```
``> const hello = 123;
> hell\u{6F}
123
```
String literals:
- Strings are internally stored as UTF-16 code units.
- A hex escape `\xHH` contributes a UTF-16 code unit.
- A 4-digit Unicode escape `\uHHHH` contributes a UTF-16 code unit.
- A Unicode code point escape `\u{···}` contributes the UTF-16 encoding of its code point (one or two UTF-16 code units).
Template literals:
- In template literals, escape sequences are handled like in string literals.
- In tagged templates, how escape sequences are interpreted depends on the tag function. It can choose between two interpretations:
- Cooked: escape sequences are handled like in string literals.
- Raw: escape sequences are handled as a sequence of characters.
```
``> `hell\u{6F}` // cooked
'hello'
> String.raw`hell\u{6F}` // raw
'hell\\u{6F}'
```
Regular expressions:
- Unicode code point escapes are only allowed if the flag `/u` is set, because `\u{3}` is interpreted as three times the character `u`, otherwise:
```
`` > /^\u{3}$/.test('uuu')
true
```
### 26.2.2 Escape sequences in the ES6 spec
Various information:
- The spec treats source code as a sequence of Unicode code points: “[Source Text](http://www.ecma-international.org/ecma-262/6.0/#sec-source-text)”
- Unicode escape sequences sequences in identifiers: “[Names and Keywords](http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords)”
- Strings are internally stored as sequences of UTF-16 code units: “[String Literals](http://www.ecma-international.org/ecma-262/6.0/#sec-literals-string-literals)”
- Strings – how various escape sequences are translated to UTF-16 code units: “[Static Semantics: SV](http://www.ecma-international.org/ecma-262/6.0/#sec-static-semantics-sv)”
- Template literals – how various escape sequences are translated to UTF-16 code units: “[Static Semantics: TV and TRV](http://www.ecma-international.org/ecma-262/6.0/#sec-static-semantics-tv-and-trv)”
#### 26.2.2.1 Regular expressions
The spec distinguishes between BMP patterns (flag `/u` not set) and Unicode patterns (flag `/u` set). Sect. “[Pattern Semantics](http://www.ecma-international.org/ecma-262/6.0/#sec-pattern-semantics)” explains that they are handled differently and how.
As a reminder, here is how grammar rules are be parameterized in the spec:
- If a grammar rule `R` has the subscript `[U]` then that means there are two versions of it: `R` and `R_U`.
- Parts of the rule can pass on the subscript via `[?U]`.
- If a part of a rule has the prefix `[+U]` it only exists if the subscript `[U]` is present.
- If a part of a rule has the prefix `[~U]` it only exists if the subscript `[U]` is not present.
You can see this parameterization in action in Sect. “[Patterns](http://www.ecma-international.org/ecma-262/6.0/#sec-patterns)”, where the subscript `[U]` creates separate grammars for BMP patterns and Unicode patterns:
- IdentityEscape: In BMP patterns, many characters can be prefixed with a backslash and are interpreted as themselves (for example: if `\u` is not followed by four hexadecimal digits, it is interpreted as `u`). In Unicode patterns that only works for the following characters (which frees up `\u` for Unicode code point escapes): `^ $ \ . * + ? ( ) [ ] { } |`
- RegExpUnicodeEscapeSequence: `"\u{" HexDigits "}"` is only allowed in Unicode patterns. In those patterns, lead and trail surrogates are also grouped to help with UTF-16 decoding.
Sect. “[CharacterEscape](http://www.ecma-international.org/ecma-262/6.0/#sec-characterescape)” explains how various escape sequences are translated to *characters* (roughly: either code units or code points).
## Further reading
“[JavaScript has a Unicode problem](https://mathiasbynens.be/notes/javascript-unicode)” (by Mathias Bynens) explains new Unicode features in ES6.
Next: [27. Tail call optimization](ch_tail-calls.html)
- 关于本书
- 目录简介
- 关于这本书你需要知道的
- 序
- 前言
- I 背景
- 1. About ECMAScript 6 (ES6)
- 2. 常见问题:ECMAScript 6
- 3. 一个JavaScript:在 ECMAScript 6 中避免版本化
- 4. 核心ES6特性
- II 数据
- 5. New number and Math features
- 6. 新的字符串特性
- 7. Symbol
- 8. Template literals
- 第9章 变量与作用域
- 第10章 解构
- 第11章 参数处理
- III 模块化
- 12. ECMAScript 6中的可调用实体
- 13. 箭头函数
- 14. 除了类之外的新OOP特性
- 15. 类
- 16. 模块
- IV 集合
- 17. The for-of loop
- 18. New Array features
- 19. Maps and Sets
- 20. 类型化数组
- 21. 可迭代对象和迭代器
- 22. 生成器( Generator )
- V 标准库
- 23. 新的正则表达式特性
- 24. 异步编程 (基础知识)
- 25. 异步编程的Promise
- VI 杂项
- 26. Unicode in ES6
- 27. 尾部调用优化
- 28 用 Proxy 实现元编程
- 29. Coding style tips for ECMAScript 6
- 30. 概述ES6中的新内容
- 注释
- ES5过时了吗?
- ==个人笔记==