blob: 4497823be7440fd32db31760471fef1af047f39a [file] [log] [blame] [view]
# UCA (UTS #10)
## DUCET
For each new Unicode version, once the repertoire is final and
the character properties are pretty stable (coming up on the beta),
Ken inserts all of the new characters into the default sort order.
For a few releases, he has documented his incremental progress with valuable notes
sent to the properties mailing list (formerly the ucd-dev list).
Markus has been taking the incremental file changes, and the notes, into this repo.
See the history of commits that changed decomps.txt and allkeys.txt.
(We lost some of that history in the Unicode server crash of 2020.)
- For UCA 15.1 see https://github.com/unicode-org/unicodetools/pull/403
- For UCA 15 see https://github.com/unicode-org/unicodetools/pull/246
- For UCA 14 see https://github.com/unicode-org/unicodetools/pull/71
- For the collection of notes for UCA 10 see ducet.md.
## Before generating
(Same prerequisite as for [security data](../security.md).)
First, in CLDR, update the script metadata:
https://cldr.unicode.org/development/updating-codes/updating-script-metadata
We need the script “ID Usage” (e.g., Limited_Use) and script sample characters
for the CLDR/ICU FractionalUCA.txt data.
## Tools & tests
1. Note: This will only work after building the UCD files for this version.
Starting with Unicode 15, those files should always be in
https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
2. We also need the UCA/DUCET files in
https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/uca/dev
When they become first available for a new version, or when they are updated:
1. We get the updated files from Ken, or we run the sifter tool, and
update the files in .../data/uca/dev.
1. Download Ken's UCA files (allkeys.txt & decomps.txt).
1. Update the input files for the UCA tools, at
{this repo}/unicodetools/data/uca/dev
3. You will use `org.unicode.text.UCA.Main` as your main class.
Normally use the command-line options `writeCollationValidityLog ICU`.
Possible additional options (VM arguments):
- -DNODATE (suppresses date output, to avoid gratuitous diffs during
development)
- -DAUTHOR (suppresses only the author suffix from the date)
- -DAUTHOR=XYZ (sets the author suffix to " \[XYZ\]")
Using the `writeCollationValidityLog` option tests whether the UCA files are valid.
It will create a file: `{Generated}/UCA/{version}/CheckCollationValidity.html`
1. Review this file. It will list errors. Some of those are actually
warnings, and indicate possible problems (this is indicated in the text,
such as by: "These are not necessarily errors, but should be examined
for *possible* errors"). In those cases, the items should be reviewed to
make sure that there are no inadvertent problems.
2. If it is not so marked, it is a true error, and must be fixed.
3. At the end, there is section **11. Coverage**. There are two sections:
1. In UCDxxx, but not in allkeys. Check this over to make sure that
these are all the characters that should get ***implicit*** weights.
2. In allkeys, but not in UCD. These should be ***only*** contractions.
Check them over to make sure they look right also.
(The `ICU` option also builds the files needed by CLDR & ICU, and tests additional aspects.)
### UCA for CLDR & ICU
To build all the UCA files used by CLDR and ICU, use the option:
```
ICU
```
They will be built into:
```
{Generated}/UCA/{version}/
```
Sometimes there are errors, and the tool stops with an exception,
especially the first time we run the tool for a new Unicode version.
A common error is when we add a script (or just some additional characters)
to a group of scripts that share a compressible primary lead byte
in the CLDR/ICU FractionalUCA.txt data file, and
we now get too many primary weights for that lead byte.
Check the Console output for error messages.
(Sometimes the stdout/stderr output is out of order,
so an error message may not fit to its immediate surroundings.)
For example:
```
[76 F3 02] # Osage first primary
last weight: 76 F3 FE
[76 F5 02] # CANADIAN-ABORIGINAL first primary
error in class PrimaryWeight: overflow of compressible lead byte last weight: 77 0D ED
[77 0E 02] # OGHAM first primary
last weight: 77 0E B8
...
reordering group {Tale Talu Lana Cham Bali Java Mong Olck Cher Osge Cans Ogam} marked for compression but uses more than one lead byte 76..77
```
For that, read the comments in the PrimariesToFractional constructor and
adjust the code there that sets per-script collation “properties”.
In particular, look for existing adjustments with comments like
“Ancient script, avoid lead byte overflow.”
You may just need to change the script in one of these existing lines,
starting a new lead byte for an earlier script than in the previous version.
Watch for new scripts that get precious two-byte primaries although
they have very small user communities.
Ensuring that they use three-byte primary weights also avoids lead byte overflows.
On the other hand, when a minor script is cased (has lowercase+uppercase forms),
then it may make sense to use two-byte primaries in order to minimize the
size of the binary ICU data file. (The tool should default to doing this.)
Judgment call. See Cherokee, Deseret, Osage, Vithkuqi for examples.
After running the tool, diff the main mapping file and look for bad changes
(for example, more bytes per weight for common characters).
```
~/unitools/mine/src$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ~/cldr/uni/src/common/uca/FractionalUCA.txt > ../frac-14.0.txt
~/unitools/mine/src$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/15.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-15.0.txt
~/unitools/mine/src$ meld ../frac-14.0.txt ../frac-15.0.txt
```
CLDR root data files are checked into $CLDR_SRC/common/uca/
```
cp {Generated}/UCA/{version}/CollationAuxiliary/* $CLDR_SRC/common/uca/
```
See the [Unihan for CLDR](../unihan.md) page for generating new versions of
CJK collation tailorings and transliterator (transform) rules.
> :point_right: **Note**: Some of the following is outdated.
> Markus has been keeping a
> [log of what he has been doing in the ICU repo](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/changes.txt).
> Look there for the latest (top-most) section “collation: CLDR collation root, UCA DUCET”.
----
1. NFSkippable
1. Obsolete: ICU does not actually need/use this file any more.
1. A file is needed by ICU that is generated with the same tool. Just use
the input parameter "NFSkippable" to generate the file NFSafeSets.txt.
This is also a default if you do the ICU files.
1. You should then build a set of the ICU files for the previous version, if
you don't have them. Use the options:
```
version 4.2.0 ICU
```
Or whatever the last version was.
2. Now, you will want to compare versions. The key file is `UCA_Rules_NoCE.txt`.
It contains the rules expressed in ICU format, which
allows for comparison across versions of UCA without spurious variations of
the numbers getting in the way.
1. Do a Diff between the last and current versions of these files, and
verify that all the differences are either new characters, or were
authorized to be changed by the UTC.
Review the generated data; compare files, use
[blankweights.sed](https://github.com/unicode-org/cldr/blob/master/tools/scripts/uca/blankweights.sed)
or similar:
```
~/unitools/mine/Generated$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ~/cldr/uni/src/common/uca/FractionalUCA.txt > ../frac-9.txt
~/unitools/mine/Generated$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed uca/10.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-10.txt && meld ../frac-9.txt ../frac-10.txt
```
Copy all generated files to unicode.org for review & staging by Ken & editors.
Once the files look good:
* Make sure there is a CLDR ticket for the new UCA version.
* Create a branch for it.
* Copy the generated `CollationAuxiliary/*` files to the CLDR branch at `common/uca/` and commit for review.
```
~/unitools/mine$ cp Generated/uca/15.0.0/CollationAuxiliary/* ~/cldr/uni/src/common/uca/
```
Ignore files that were copied but are not version-controlled, that is,
`git status` shows a question mark status for them.
### UCA for previous version
Some of the tools code only works with the latest UCD/UCA versions. When I
(Markus) worked on UCA 7 files while UCD/UCA 8 were under way,
I set `version 7.0.0` on the command line and made the following temporary
(not committed to the repository) code changes:
```
Index: org/unicode/text/UCA/UCA.java
===================================================================
--- org/unicode/text/UCA/UCA.java (revision 742)
+++ org/unicode/text/UCA/UCA.java (working copy)
@@ -1354,7 +1354,7 @@
{0x10FFFE},
{0x10FFFF},
{UCD_Types.CJK_A_BASE, UCD_Types.CJK_A_LIMIT},
- {UCD_Types.CJK_BASE, UCD_Types.CJK_LIMIT},
+ {UCD_Types.CJK_BASE, 0x9FCC+1}, // TODO: restore for UCA 8.0! {UCD_Types.CJK_BASE, UCD_Types.CJK_LIMIT},
{0xAC00, 0xD7A3},
{0xA000, 0xA48C},
{0xE000, 0xF8FF},
@@ -1361,7 +1361,7 @@
{UCD_Types.CJK_B_BASE, UCD_Types.CJK_B_LIMIT},
{UCD_Types.CJK_C_BASE, UCD_Types.CJK_C_LIMIT},
{UCD_Types.CJK_D_BASE, UCD_Types.CJK_D_LIMIT},
- {UCD_Types.CJK_E_BASE, UCD_Types.CJK_E_LIMIT},
+ // TODO: restore for UCA 8.0! {UCD_Types.CJK_E_BASE, UCD_Types.CJK_E_LIMIT},
{0xE0000, 0xE007E},
{0xF0000, 0xF00FD},
{0xFFF00, 0xFFFFD},
Index: org/unicode/text/UCD/UCD.java
===================================================================
--- org/unicode/text/UCD/UCD.java (revision 743)
+++ org/unicode/text/UCD/UCD.java (working copy)
@@ -1345,7 +1345,7 @@
if (ch <= 0x9FCC && rCompositeVersion >= 0x60100) {
return CJK_BASE;
}
- if (ch <= 0x9FD5 && rCompositeVersion >= 0x80000) {
+ if (ch <= 0x9FD5 && rCompositeVersion > 0x80000) { // TODO: restore ">=" when really going to 8.0!
return CJK_BASE;
}
if (ch <= 0xAC00)
Index: org/unicode/text/UCD/UCD_Types.java
===================================================================
--- org/unicode/text/UCD/UCD_Types.java (revision 742)
+++ org/unicode/text/UCD/UCD_Types.java (working copy)
@@ -24,7 +24,7 @@
// 4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
// 9FD5;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
CJK_BASE = 0x4E00,
- CJK_LIMIT = 0x9FD5+1,
+ CJK_LIMIT = 0x9FCC+1, // TODO: restore for UCD 8.0! 0x9FD5+1,
CJK_COMPAT_USED_BASE = 0xFA0E,
CJK_COMPAT_USED_LIMIT = 0xFA2F+1,
```