UnicodeSets use regular-expression syntax to allow for arbitrary set operations (Union, Intersection, Difference) on sets of Unicode characters. The base sets can be specified explicitly, such as [a-m w-z]
, or using a combinations of Unicode Properties such as the following, for the Arabic script characters that have a canonical decomposition:
[[:script=arabic:]&[:decompositiontype=canonical:]]
Enter a UnicodeSet into the Input box, and hit Show Set. You can also choose certain combinations of options for display, such as abbreviated or not.
The values you use are encapsulated into a URL for reference, such as
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=\\p{sc:Greek}
If you add properties to the Group By box, you can sort the results by property values. For example, if you set it to General_Category Numeric_Value
(or the short form gc nv
), you'll see the results sorted first by the general category of the characters, and then by the numeric value.
UnicodeSets are defined according to the description on UTS #35: Locale Data Markup Language (LDML), but has some useful extensions in these online demos.
Properties can be specified either with Perl-style notation (\p{script=arabic}
) or with POSIX-style notation ([:script=arabic:]
). Properties and values can either use a long form (like script
) or a short form (like sc
).
No argument is equivalent to “Yes”; mostly useful with binary properties, like \p{isLowercase}
.
The following examples illustrate the syntax with a particular property, value pair: the property age
and the value 3.2
:
The :
can be used in the place of =
. (Mostly because :
doesn't require percent-encoding in URLs.)
\p{age:3.2}
and [:age:3.2:]
The Perl and Posix syntax for negations are \P{...}
and [:^...:]
, respectively. The characters ≠
and !
are added for convenience:
\p{age≠3.2}
and \:age≠3.2:]
\p{age!=3.2}
and \:age!=3.2:]
\p{age!:3.2}
and \:age!=3.2:]
For the name
property, regular expressions can be used for the value, enclosed in /.../
. For example in the following expression, the first term will select all those Unicode characters whose names contain “CJK”. The rest of the expression will then subtract the ideographic characters, showing that these can be used in arbitrary combinations.
[[[:name=/CJK/:]-[:ideographic:]]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B:name=/CJK/:%5D-%5B:ideographic:%5D%5D)
[[:name=/\bDOT$/:]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:name=/%5CbDOT$/:%5D)
[[:block=/(?i)arab/:]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:block=/(?i)arab/:%5D)
[[:toNFKC=/\./:]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:toNFKC=/%5C./:%5D)
Some particularly useful regex features are:
\b
means a word break, ^
means front of the string, and $
means end. So /^DOT\\b/
means the word DOT at the start.(?i)
means case-insensitive matching.Caveats:
/.../
pattern.[:...:]
syntax on the outside, such as:[:Block=/Aegean_Numbers/:]
returns a different number of characters than [:Block=Aegean_Numbers:]
, because it skips Unassigned code points.[:Block=aegeannumbers:]
works, but [:Block=/aegeannumbers/:]
fails -- you have to use [:Block=/Aegean_Numbers/:]
or [:Block=/(?i)aegean_numbers/:]
.Property values can be compared to those for other properties, using the syntax @...@
. For example:
\p{idna2003!=@uts46@}
\p{idna2003!=@uts46@}&\\p{age=3.2}
There is a special property “cp” that returns the code point itself. For example:
\p{toLowercase!=@cp@}
You can see a full listing of the possible properties on https://util.unicode.org/UnicodeJsps/properties.jsp. The standard Unicode properties are supported, plus the extra ICU properties. There are some additional properties just in this demo. The easiest way to see the properties for a range of characters is to use a set like [:Greek:]
in the Input, and then set the Group By box to the property name.
Normally, \p{isX} is equivalent to \p{toX=@cp@}
. There are some exceptions and missing cases.
Note: The Unassigned, Surrogate, and Private Use code points are skipped in the generation of some of these sets.
The following provides details for some cases.
Unicode defines a number of string casing functions in Section 3.13 Default Case Algorithms. These string functions can also be applied to single characters.Warning: the first three sets may be somewhat misleading: isLowercase means that the character is the same as its lowercase version, which includes all uncased characters. To get those characters that are cased characters and lowercase, use [[:isLowercase:]&[:isCased:]]
The binary testing operations take no argument:
The string functions are also provided, and require an argument. For example:
[:toLowercase=a:]
[:toCaseFold=a:]
[:toUppercase=A:]
[:toTitlecase=A:]
Note: The Unassigned, Surrogate, and Private Use code points are skipped in generation of the sets.
Unicode defines a number of string normalization functions UAX #15. These string functions can also be applied to single characters.
[:toNFC=a:]
[:toNFD=A\u0300:]
[:toNFKC=A:]
[:toNFKD=A\u0300:]