The proposal “RegExp Unicode Property Escapes” by Mathias Bynens is at stage 4. This blog post explains how it works.
JavaScript lets you match characters by mentioning the “names” of sets of characters. For example, \s
stands for “whitespace”:
> /^\s+$/u.test('\t \n\r')
true
The proposal lets you additionally match characters by mentioning their Unicode character properties (what those are is explained next) inside the curly braces of \p{}
. Two examples:
> /^\p{White_Space}+$/u.test('\t \n\r')
true
> /^\p{Script=Greek}+$/u.test('μετά')
true
As you can see, one of the benefits of property escapes is is that they make regular expressions more self-descriptive. Additional benefits will become clear later.
Before we delve into how property escapes work, let’s examine what Unicode character properties are.
In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character. Quoting the Unicode Standard, Sect. 3.3, D3:
The semantics of a character are determined by its identity, normative properties, and behavior.
These are a few examples of properties:
Name
: a unique name, composed of uppercase letters, digits, hyphens and spaces. For example:
Name = LATIN CAPITAL LETTER A
Name = GRINNING FACE
General_Category
: categorizes characters. For example:
General_Category = Lowercase_Letter
General_Category = Currency_Symbol
White_Space
: used for marking invisible spacing characters, such as spaces, tabs and newlines. For example:
White_Space = True
White_Space = False
Age
: version of the Unicode Standard in which a character was introduced. For example: The Euro sign € was added in version 2.1 of the Unicode standard.
Age = 2.1
Block
: a contiguous range of code points. Blocks don’t overlap and their names are unique. For example:
Block = Basic_Latin
(range U+0000..U+007F)Block = Cyrillic
(range U+0400..U+04FF)Script
: is a collection of characters used by one or more writing systems.
Script = Greek
Script = Hebrew
The following types of properties exist:
General_Category
is an enumerated property.True
and False
. Boolean properties are also called binary, because they are like markers that characters either have or not. White_Space
is a binary property.Age
and Script
are catalog properties.Name
is a miscellaneous property.Properties and property values are matched as follows:
"General_Category"
, "general category"
, "-general-category-"
, "GeneralCategory"
are all considered to be the same property.PropertyAliases.txt
and PropertyValueAliases.txt
define alternative ways of referring to properties and property values.
General_Category
gc
Lowercase_Letter
, Ll
Currency_Symbol
, Sc
True
, T
, Yes
, Y
False
, F
, No
, N
Unicode property escapes look like this:
prop
has the value value
:\p{prop=value}
prop
whose value is value
:\P{prop=value}
bin_prop
is True:\p{bin_prop}
bin_prop
is False:\P{bin_prop}
Forms (3) and (4) can also be used as an abbreviation for General_Category
. For example: \p{Lowercase_Letter}
is an abbreviation for \p{General_Category=Lowercase_Letter}
Important: In order to use property escapes, regular expressions must have the flag /u
. Prior to /u
, \p
is the same as p
.
Things to note:
PropertyAliases.txt
and PropertyValueAliases.txt
General_Category
Script
Script_Extensions
Alphabetic
, Uppercase
, Lowercase
, White_Space
, Noncharacter_Code_Point
, Default_Ignorable_Code_Point
, Any
, ASCII
, Assigned
, ID_Start
, ID_Continue
, Join_Control
, Emoji_Presentation
, Emoji_Modifier
, Emoji_Modifier_Base
.Matching whitespace:
> /^\p{White_Space}+$/u.test('\t \n\r')
true
Matching letters:
> /^\p{Letter}+$/u.test('πüé')
true
Matching Greek letters:
> /^\p{Script=Greek}+$/u.test('μετά')
true
Matching Latin letters:
> /^\p{Script=Latin}+$/u.test('Grüße')
true
> /^\p{Script=Latin}+$/u.test('façon')
true
> /^\p{Script=Latin}+$/u.test('mañana')
true
Matching lone surrogate characters:
> /^\p{Surrogate}+$/u.test('\u{D83D}')
true
> /^\p{Surrogate}+$/u.test('\u{DE00}')
true
Note that Unicode code points in astral planes (such as emojis) are composed of two JavaScript characters (a leading surrogate and a trailing surrogate). Therefore, you’d expect the previous regular expression to match the emoji 😀, which is all surrogates:
> '😀'.length
2
> '😀'.charCodeAt(0).toString(16)
'd83d'
> '😀'.charCodeAt(1).toString(16)
'de00'
However, with the /u
flag, property escapes match code points, not JavaScript characters:
> /^\p{Surrogate}+$/u.test('😀')
false
In other words, 😀 is considered to be a single character:
> /^.$/u.test('😀')
true
V8 5.8+ implement this proposal, it is switched on via --harmony_regexp_property
:
node --harmony_regexp_property
npm version
chrome://version/
/Applications/Google Chrome.app/Contents/MacOS/Google Chrome
'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome' --js-flags="--harmony_regexp_property"
JavaScript:
/u
(unicode)” (in “Exploring ES6”)The Unicode standard:
PropList.txt
, PropertyAliases.txt
, PropertyValueAliases.txt