企业🤖AI智能体构建引擎,智能编排和调试,一键部署,支持私有化部署方案 广告
# 23. 新的正则功能 本章介绍ECMAScript 6中的新正则表达式功能。如果您熟悉ES5正则表达式功能和Unicode,它将有所帮助。如有必要,请参阅“Speaking JavaScript”的以下两章: - “[Regular Expressions](http://speakingjs.com/es5/ch19.html)” - “[Unicode and JavaScript](http://speakingjs.com/es5/ch24.html)” ## 23.1 概述 以下是ECMAScript 6中正则表达式的新功能: - 新的标志 `/y` (sticky) 将正则表达式的每一匹配锚定到前一匹配的末尾。 - 新的标志 `/u` (unicode) handles surrogate pairs (比如 `\uD83D\uDE80`) as code points and 让你在正则表达式中使用Unicode编码点 (比如 `\u{1F680}`) 进行转义. - 新数据属性 `flags` 让你可以访问正则表达式的标志,就像 `source` 已经让您可以访问ES5中的模式(pattern): ~~~ > /abc/ig.source // ES5 'abc' > /abc/ig.flags // ES6 'gi' ~~~ - 您可以使用构造函数`RegExp()`来创建一个正则表达式的副本: ``` `` > new RegExp(/abc/ig).flags 'gi' > new RegExp(/abc/ig, 'i').flags // change flags 'i' ``` ## 23.2 新的标志 `/y` (sticky) 新的标志 `/y` changes two things while matching a regular expression `re` against a string: - Anchored to `re.lastIndex`: The match must start at `re.lastIndex` (the index after the previous match). This behavior is similar to the `^` anchor, but with that anchor, matches must always start at index 0. - Match repeatedly: If a match was found, `re.lastIndex` is set to the index after the match. This behavior is similar to the `/g` flag. Like `/g`, `/y` is normally used to match multiple times. The main use case for this matching behavior is tokenizing, where you want each match to immediately follow its predecessor. An example of tokenizing via a sticky regular expression and `exec()` is given later. Let’s look at how various regular expression operations react to the `/y` flag. The following tables give an overview. I’ll provide more details afterwards. Methods of regular expressions (`re` is the regular expression that a method is invoked on): Flags Start matching Anchored to Result if match No match re.lastIndex exec() – 0 – Match object null unchanged /g `re.lastIndex` – Match object null index after match /y `re.lastIndex` `re.lastIndex` Match object null index after match /gy `re.lastIndex` `re.lastIndex` Match object null index after match test() (Any) (like exec()) (like exec()) true false (like exec())Methods of strings (`str` is the string that a method is invoked on, `r` is the regular expression parameter): Flags Start matching Anchored to Result if match No match r.lastIndex search() –, /g 0 – Index of match -1 unchanged /y, /gy 0 0 Index of match -1 unchanged match() – 0 – Match object null unchanged /y `r.lastIndex` `r.lastIndex` Match object null index after match /g After prev. – Array with matches null 0 match (loop) /gy After prev. After prev. Array with matches null 0 match (loop) match split() –, /g After prev. – Array with strings `[str]` unchanged match (loop) between matches /y, /gy After prev. After prev. Arr. w/ empty strings `[str]` unchanged match (loop) match between matches replace() – 0 – First match replaced No repl. unchanged /y 0 0 First match replaced No repl. unchanged /g After prev. – All matches replaced No repl. unchanged match (loop) /gy After prev. After prev. All matches replaced No repl. unchanged match (loop) match ### 23.2.1 `RegExp.prototype.exec(str)` If `/g` is not set, matching always starts at the beginning, but skips ahead until a match is found. `REGEX.lastIndex` is not changed. ```const` `REGEX` `=` `/a/``;` `REGEX``.``lastIndex` `=` `7``;` `// ignored` `const` `match` `=` `REGEX``.``exec``(``'xaxa'``);` `console``.``log``(``match``.``index``);` `// 1` `console``.``log``(``REGEX``.``lastIndex``);` `// 7 (unchanged)` If `/g` is set, matching starts at `REGEX.lastIndex` and skips ahead until a match is found. `REGEX.lastIndex` is set to the position after the match. That means that you receive all matches if you loop until `exec()` returns `null`. ```const` `REGEX` `=` `/a/g``;` `REGEX``.``lastIndex` `=` `2``;` `const` `match` `=` `REGEX``.``exec``(``'xaxa'``);` `console``.``log``(``match``.``index``);` `// 3` `console``.``log``(``REGEX``.``lastIndex``);` `// 4 (updated)` `// No match at index 4 or later` `console``.``log``(``REGEX``.``exec``(``'xaxa'``));` `// null` If only `/y` is set, matching starts at `REGEX.lastIndex` and is anchored to that position (no skipping ahead until a match is found). `REGEX.lastIndex` is updated similarly to when `/g` is set. ```const` `REGEX` `=` `/a/y;` `// No match at index 2` `REGEX``.``lastIndex` `=` `2``;` `console``.``log``(``REGEX``.``exec``(``'xaxa'``));` `// null` `// Match at index 3` `REGEX``.``lastIndex` `=` `3``;` `const` `match` `=` `REGEX``.``exec``(``'xaxa'``);` `console``.``log``(``match``.``index``);` `// 3` `console``.``log``(``REGEX``.``lastIndex``);` `// 4` Setting both `/y` and `/g` is the same as only setting `/y`. ### 23.2.2 `RegExp.prototype.test(str)` `test()` 和 `exec()` 类似,但是匹配成功或失败时,它返回 `true` 或者 `false` (而不是匹配的对象或 `null`) : ~~~ const REGEX = /a/y; REGEX.lastIndex = 2 ; console.log(REGEX.test('xaxa')); // false REGEX.``lastIndex = 3; console.log(REGEX.test('xaxa')); // true console.log(REGEX.lastIndex); // 4 ~~~ ### 23.2.3 `String.prototype.search(regex)` `search()` ignores the flag `/g` and `lastIndex` (which is not changed, either). Starting at the beginning of the string, it looks for the first match and returns its index (or `-1` if there was no match): ```const` `REGEX` `=` `/a/``;` `REGEX``.``lastIndex` `=` `2``;` `// ignored` `console``.``log``(``'xaxa'``.``search``(``REGEX``));` `// 1` If you set the flag `/y`, `lastIndex` is still ignored, but the regular expression is now anchored to index 0. ```const` `REGEX` `=` `/a/y;` `REGEX``.``lastIndex` `=` `1``;` `// ignored` `console``.``log``(``'xaxa'``.``search``(``REGEX``));` `// -1 (no match)` ### 23.2.4 `String.prototype.match(regex)` `match()` has two modes: - If `/g` is not set, it works like `exec()`. - If `/g` is set, it returns an Array with the string parts that matched, or `null`. If the flag `/g` is not set, `match()` captures groups like `exec()`: ```{` `const` `REGEX` `=` `/a/``;` `REGEX``.``lastIndex` `=` `7``;` `// ignored` `console``.``log``(``'xaxa'``.``match``(``REGEX``).``index``);` `// 1` `console``.``log``(``REGEX``.``lastIndex``);` `// 7 (unchanged)` `}` `{` `const` `REGEX` `=` `/a/y;` `REGEX``.``lastIndex` `=` `2``;` `console``.``log``(``'xaxa'``.``match``(``REGEX``));` `// null` `REGEX``.``lastIndex` `=` `3``;` `console``.``log``(``'xaxa'``.``match``(``REGEX``).``index``);` `// 3` `console``.``log``(``REGEX``.``lastIndex``);` `// 4` `}` If only the flag `/g` is set then `match()` returns all matching substrings in an Array (or `null`). Matching always starts at position 0. ```const` `REGEX` `=` `/a|b/g``;` `REGEX``.``lastIndex` `=` `7``;` `console``.``log``(``'xaxb'``.``match``(``REGEX``));` `// ['a', 'b']` `console``.``log``(``REGEX``.``lastIndex``);` `// 0` If you additionally set the flag `/y`, then matching is still performed repeatedly, while anchoring the regular expression to the index after the previous match (or 0). ```const` `REGEX` `=` `/a|b/gy;` `REGEX``.``lastIndex` `=` `0``;` `// ignored` `console``.``log``(``'xab'``.``match``(``REGEX``));` `// null` `REGEX``.``lastIndex` `=` `1``;` `// ignored` `console``.``log``(``'xab'``.``match``(``REGEX``));` `// null` `console``.``log``(``'ab'``.``match``(``REGEX``));` `// ['a', 'b']` `console``.``log``(``'axb'``.``match``(``REGEX``));` `// ['a']` ### 23.2.5 `String.prototype.split(separator, limit)` The complete details of `split()` [are explained in Speaking JavaScript](http://speakingjs.com/es5/ch19.html#String.prototype.match). For ES6, it is interesting to see how things change if you use the flag `/y`. With `/y`, the string must start with a separator: ``` ``> 'x##'.split(/#/y) // no match [ 'x##' ] > '##x'.split(/#/y) // 2 matches [ '', '', 'x' ] ``` Subsequent separators are only recognized if they immediately follow the first separator: ``` ``> '#x#'.split(/#/y) // 1 match [ '', 'x#' ] > '##'.split(/#/y) // 2 matches [ '', '', '' ] ``` That means that the string before the first separator and the strings between separators are always empty. As usual, you can use groups to put parts of the separators into the result array: ``` ``> '##'.split(/(#)/y) [ '', '#', '', '#', '' ] ``` ### 23.2.6 `String.prototype.replace(search, replacement)` Without the flag `/g`, `replace()` only replaces the first match: ```const` `REGEX` `=` `/a/``;` `// One match` `console``.``log``(``'xaxa'``.``replace``(``REGEX``,` `'-'``));` `// 'x-xa'` If only `/y` is set, you also get at most one match, but that match is always anchored to the beginning of the string. `lastIndex` is ignored and unchanged. ```const` `REGEX` `=` `/a/y;` `// Anchored to beginning of string, no match` `REGEX``.``lastIndex` `=` `1``;` `// ignored` `console``.``log``(``'xaxa'``.``replace``(``REGEX``,` `'-'``));` `// 'xaxa'` `console``.``log``(``REGEX``.``lastIndex``);` `// 1 (unchanged)` `// One match` `console``.``log``(``'axa'``.``replace``(``REGEX``,` `'-'``));` `// '-xa'` With `/g` set, `replace()` replaces all matches: ```const` `REGEX` `=` `/a/g``;` `// Multiple matches` `console``.``log``(``'xaxa'``.``replace``(``REGEX``,` `'-'``));` `// 'x-x-'` With `/gy` set, `replace()` replaces all matches, but each match is anchored to the end of the previous match: ```const` `REGEX` `=` `/a/gy;` `// Multiple matches` `console``.``log``(``'aaxa'``.``replace``(``REGEX``,` `'-'``));` `// '--xa'` The parameter `replacement` can also be a function, [consult “Speaking JavaScript” for details](http://speakingjs.com/es5/ch19.html#String.prototype.replace). ### 23.2.7 Example: using sticky matching for tokenizing The main use case for sticky matching is *tokenizing*, turning a text into a sequence of tokens. One important trait about tokenizing is that tokens are fragments of the text and that there must be no gaps between them. Therefore, sticky matching is perfect here. ```function` `tokenize``(``TOKEN_REGEX``,` `str``)` `{` `const` `result` `=` `[];` `let` `match``;` `while` `(``match` `=` `TOKEN_REGEX``.``exec``(``str``))` `{` `result``.``push``(``match``[``1``]);` `}` `return` `result``;` `}` `const` `TOKEN_GY` `=` `/\s*(\+|[0-9]+)\s*/gy;` `const` `TOKEN_G` `=` `/\s*(\+|[0-9]+)\s*/g``;` In a legal sequence of tokens, sticky matching and non-sticky matching produce the same output: ``` ``> tokenize(TOKEN_GY, '3 + 4') [ '3', '+', '4' ] > tokenize(TOKEN_G, '3 + 4') [ '3', '+', '4' ] ``` If, however, there is non-token text in the string then sticky matching stops tokenizing, while non-sticky matching skips the non-token text: ``` ``> tokenize(TOKEN_GY, '3x + 4') [ '3' ] > tokenize(TOKEN_G, '3x + 4') [ '3', '+', '4' ] ``` The behavior of sticky matching during tokenizing helps with error handling. ### 23.2.8 Example: manually implementing sticky matching If you wanted to manually implement sticky matching, you’d do it as follows: The function `execSticky()` works like `RegExp.prototype.exec()` in sticky mode. `` `function` `execSticky``(``regex``,` `str``)` `{` `// Anchor the regex to the beginning of the string` `let` `matchSource` `=` `regex``.``source``;` `if` `(``!``matchSource``.``startsWith``(``'^'``))` `{` `matchSource` `=` `'^'` `+` `matchSource``;` `}` `// Ensure that instance property `lastIndex` is updated` `let` `matchFlags` `=` `regex``.``flags``;` `// ES6 feature!` `if` `(``!``regex``.``global``)` `{` `matchFlags` `=` `matchFlags` `+` `'g'``;` `}` `const` `matchRegex` `=` `new` `RegExp``(``matchSource``,` `matchFlags``);` `// Ensure we start matching `str` at `regex.lastIndex`` `const` `matchOffset` `=` `regex``.``lastIndex``;` `const` `matchStr` `=` `str``.``slice``(``matchOffset``);` `let` `match` `=` `matchRegex``.``exec``(``matchStr``);` `// Translate indices from `matchStr` to `str`` `regex``.``lastIndex` `=` `matchRegex``.``lastIndex` `+` `matchOffset``;` `match``.``index` `=` `match``.``index` `+` `matchOffset``;` `return` `match``;` `}` ## 23.3 New flag `/u` (unicode) The flag `/u` switches on a special Unicode mode for a regular expression. That mode has two features: 1. You can use Unicode code point escape sequences such as `\u{1F42A}` for specifying characters via code points. Normal Unicode escapes such as `\u03B1` only have a range of four hexadecimal digits (which equals the basic multilingual plane). 2. “characters” in the regular expression pattern and the string are code points (not UTF-16 code units). Code units are converted into code points. [A section in the chapter on Unicode](ch_unicode.html#sec_escape-sequences) has more information on escape sequences. I’ll explain the consequences of feature 2 next. Instead of Unicode code point escapes (e.g., `\u{1F680}`), I’m using two UTF-16 code units (e.g., `\uD83D\uDE80`). That makes it clear that surrogate pairs are grouped in Unicode mode and works in both Unicode mode and non-Unicode mode. ``` ``> '\u{1F680}' === '\uD83D\uDE80' // code point vs. surrogate pairs true ``` ### 23.3.1 Consequence: lone surrogates in the regular expression only match lone surrogates In non-Unicode mode, a lone surrogate in a regular expression is even found inside (surrogate pairs encoding) code points: ``` ``> /\uD83D/.test('\uD83D\uDC2A') true ``` In Unicode mode, surrogate pairs become atomic units and lone surrogates are not found “inside” them: ``` ``> /\uD83D/u.test('\uD83D\uDC2A') false ``` Actual lone surrogate are still found: ``` ``> /\uD83D/u.test('\uD83D \uD83D\uDC2A') true > /\uD83D/u.test('\uD83D\uDC2A \uD83D') true ``` ### 23.3.2 Consequence: you can put code points in character classes In Unicode mode, you can put code points into character classes and they won’t be interpreted as two characters, anymore. ``` ``> /^[\uD83D\uDC2A]$/u.test('\uD83D\uDC2A') true > /^[\uD83D\uDC2A]$/.test('\uD83D\uDC2A') false > /^[\uD83D\uDC2A]$/u.test('\uD83D') false > /^[\uD83D\uDC2A]$/.test('\uD83D') true ``` ### 23.3.3 Consequence: the dot operator (`.`) matches code points, not code units In Unicode mode, the dot operator matches code points (one or two code units). In non-Unicode mode, it matches single code units. For example: ``` ``> '\uD83D\uDE80'.match(/./gu).length 1 > '\uD83D\uDE80'.match(/./g).length 2 ``` ### 23.3.4 Consequence: quantifiers apply to code points, not code units In Unicode mode, quantifiers apply to code points (one or two code units). In non-Unicode mode, they apply to single code units. For example: ``` ``> /\uD83D\uDE80{2}/u.test('\uD83D\uDE80\uD83D\uDE80') true > /\uD83D\uDE80{2}/.test('\uD83D\uDE80\uD83D\uDE80') false > /\uD83D\uDE80{2}/.test('\uD83D\uDE80\uDE80') true ``` ## 23.4 New data property `flags` In ECMAScript 6, regular expressions have the following data properties: - The pattern: `source` - The flags: `flags` - Individual flags: `global`, `ignoreCase`, `multiline`, `sticky`, `unicode` - Other: `lastIndex` As an aside, `lastIndex` is the only instance property now, all other data properties are implemented via internal instance properties and getters such as [`get RegExp.prototype.global`](http://www.ecma-international.org/ecma-262/6.0/#sec-get-regexp.prototype.global). The property `source` (which already existed in ES5) contains the regular expression pattern as a string: ``` ``> /abc/ig.source 'abc' ``` The property `flags` is new, it contains the flags as a string, with one character per flag: ``` ``> /abc/ig.flags 'gi' ``` You can’t change the flags of an existing regular expression (`ignoreCase` etc. have always been immutable), but `flags` allows you to make a copy where the flags are changed: ```function` `copyWithIgnoreCase``(``regex``)` `{` `return` `new` `RegExp``(``regex``.``source``,` `regex``.``flags``+``'i'``);` `}` The next section explains another way to make modified copies of regular expressions. ## 23.5 `RegExp()` can be used as a copy constructor In ES6 there are two variants of the constructor `RegExp()` (the second one is new): - `new RegExp(pattern : string, flags = '')` A new regular expression is created as specified via `pattern`. If `flags` is missing, the empty string `''` is used. - `new RegExp(regex : RegExp, flags = regex.flags)` `regex` is cloned. If `flags` is provided then it determines the flags of the copy. The following interaction demonstrates the latter variant: ``` ``> new RegExp(/abc/ig).flags 'gi' > new RegExp(/abc/ig, 'i').flags // change flags 'i' ``` Therefore, the `RegExp` constructor gives us another way to change flags: ```function` `copyWithIgnoreCase``(``regex``)` `{` `return` `new` `RegExp``(``regex``,` `regex``.``flags``+``'i'``);` `}` ### 23.5.1 Example: an iterable version of `exec()` The following function `execAll()` is an iterable version of `exec()` that fixes several issues with using `exec()` to retrieve all matches of a regular expression: - Looping over the matches is unnecessarily complicated (you call `exec()` until it returns `null`). - `exec()` mutates the regular expression, which means that side effects can become a problem. - The flag `/g` must be set. Otherwise, only the first match is returned. ```function``*` `execAll``(``regex``,` `str``)` `{` `// Make sure flag /g is set and regex.index isn’t changed` `const` `localCopy` `=` `new` `RegExp``(``regex``,` `regex``.``flags``+``'g'``);` `let` `match``;` `while` `(``match` `=` `localCopy``.``exec``(``str``))` `{` `yield` `match``;` `}` `}` Using `execAll()`: ```const` `str` `=` `'"fee" "fi" "fo" "fum"'``;` `const` `regex` `=` `/"([^"]*)"/``;` `// Access capture of group #1 via destructuring` `for` `(``const` `[,` `group1``]` `of` `execAll``(``regex``,` `str``))` `{` `console``.``log``(``group1``);` `}` `// Output:` `// fee` `// fi` `// fo` `// fum` ## 23.6 String methods that delegate to regular expression methods The following string methods now delegate some of their work to regular expression methods: - `String.prototype.match` calls `RegExp.prototype[Symbol.match]`. - `String.prototype.replace` calls `RegExp.prototype[Symbol.replace]`. - `String.prototype.search` calls `RegExp.prototype[Symbol.search]`. - `String.prototype.split` calls `RegExp.prototype[Symbol.split]`. For more information, consult Sect. “[String methods that delegate regular expression work to their parameters](ch_strings.html#sec_delegating-string-methods-regexp)” in the chapter on strings. ## Further reading If you want to know in more detail how the regular expression flag `/u` works, I recommend the article “[Unicode-aware regular expressions in ECMAScript 6](https://mathiasbynens.be/notes/es6-unicode-regex)” by Mathias Bynens. Next: [24. Asynchronous programming (background)](ch_async.html)