UCA (UTS #10)

DUCET

For each new Unicode version, once the repertoire is final and the character properties are pretty stable (coming up on the beta), Ken inserts all of the new characters into the default sort order.

For a few releases, he has documented his incremental progress with valuable notes sent to the properties mailing list (formerly the ucd-dev list). Markus has been taking the incremental file changes, and the notes, into this repo.

See the history of commits that changed decomps.txt and allkeys.txt. (We lost some of that history in the Unicode server crash of 2020.)

Before generating

(Same prerequisite as for security data.)

First, in CLDR, update the script metadata: https://cldr.unicode.org/development/updating-codes/updating-script-metadata

We need the script “ID Usage” (e.g., Limited_Use) and script sample characters for the CLDR/ICU FractionalUCA.txt data.

Tools & tests

  1. Note: This will only work after building the UCD files for this version. Starting with Unicode 15, those files should always be in https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
  2. We also need the UCA/DUCET files in https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/uca/dev When they become first available for a new version, or when they are updated:
    1. We get the updated files from Ken, or we run the sifter tool, and update the files in .../data/uca/dev.
    2. Download Ken's UCA files (allkeys.txt & decomps.txt).
    3. Update the input files for the UCA tools, at {this repo}/unicodetools/data/uca/dev
  3. You will use org.unicode.text.UCA.Main as your main class. Normally use the command-line options writeCollationValidityLog ICU. Possible additional options (VM arguments):
    • -DNODATE (suppresses date output, to avoid gratuitous diffs during development)
    • -DAUTHOR (suppresses only the author suffix from the date)
    • -DAUTHOR=XYZ (sets the author suffix to " [XYZ]")

Using the writeCollationValidityLog option tests whether the UCA files are valid. It will create a file: {Generated}/UCA/{version}/CheckCollationValidity.html

  1. Review this file. It will list errors. Some of those are actually warnings, and indicate possible problems (this is indicated in the text, such as by: “These are not necessarily errors, but should be examined for possible errors”). In those cases, the items should be reviewed to make sure that there are no inadvertent problems.
  2. If it is not so marked, it is a true error, and must be fixed.
  3. At the end, there is section 11. Coverage. There are two sections:
    1. In UCDxxx, but not in allkeys. Check this over to make sure that these are all the characters that should get implicit weights.
    2. In allkeys, but not in UCD. These should be only contractions. Check them over to make sure they look right also.

(The ICU option also builds the files needed by CLDR & ICU, and tests additional aspects.)

UCA for CLDR & ICU

To build all the UCA files used by CLDR and ICU, use the option:

ICU

They will be built into:

{Generated}/UCA/{version}/

Sometimes there are errors, and the tool stops with an exception, especially the first time we run the tool for a new Unicode version.

A common error is when we add a script (or just some additional characters) to a group of scripts that share a compressible primary lead byte in the CLDR/ICU FractionalUCA.txt data file, and we now get too many primary weights for that lead byte. Check the Console output for error messages. (Sometimes the stdout/stderr output is out of order, so an error message may not fit to its immediate surroundings.) For example:

[76 F3 02]  # Osage first primary
last weight: 76 F3 FE
[76 F5 02]  # CANADIAN-ABORIGINAL first primary
error in class PrimaryWeight: overflow of compressible lead byte last weight: 77 0D ED
[77 0E 02]  # OGHAM first primary
last weight: 77 0E B8
...
reordering group {Tale Talu Lana Cham Bali Java Mong Olck Cher Osge Cans Ogam} marked for compression but uses more than one lead byte 76..77

For that, read the comments in the PrimariesToFractional constructor and adjust the code there that sets per-script collation “properties”. In particular, look for existing adjustments with comments like “Ancient script, avoid lead byte overflow.” You may just need to change the script in one of these existing lines, starting a new lead byte for an earlier script than in the previous version.

Watch for new scripts that get precious two-byte primaries although they have very small user communities. Ensuring that they use three-byte primary weights also avoids lead byte overflows. On the other hand, when a minor script is cased (has lowercase+uppercase forms), then it may make sense to use two-byte primaries in order to minimize the size of the binary ICU data file. (The tool should default to doing this.) Judgment call. See Cherokee, Deseret, Osage, Vithkuqi for examples.

After running the tool, diff the main mapping file and look for bad changes (for example, more bytes per weight for common characters).

~/unitools/mine/src$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ~/cldr/uni/src/common/uca/FractionalUCA.txt > ../frac-14.0.txt
~/unitools/mine/src$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ../Generated/UCA/15.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-15.0.txt
~/unitools/mine/src$ meld ../frac-14.0.txt ../frac-15.0.txt

CLDR root data files are checked into $CLDR_SRC/common/uca/

cp {Generated}/UCA/{version}/CollationAuxiliary/* $CLDR_SRC/common/uca/

See the Unihan for CLDR page for generating new versions of CJK collation tailorings and transliterator (transform) rules.

:point_right: Note: Some of the following is outdated. Markus has been keeping a log of what he has been doing in the ICU repo. Look there for the latest (top-most) section “collation: CLDR collation root, UCA DUCET”.


  1. NFSkippable

    1. Obsolete: ICU does not actually need/use this file any more.
    2. A file is needed by ICU that is generated with the same tool. Just use the input parameter “NFSkippable” to generate the file NFSafeSets.txt. This is also a default if you do the ICU files.
  2. You should then build a set of the ICU files for the previous version, if you don't have them. Use the options:

    version 4.2.0 ICU
    

    Or whatever the last version was.

  3. Now, you will want to compare versions. The key file is UCA_Rules_NoCE.txt. It contains the rules expressed in ICU format, which allows for comparison across versions of UCA without spurious variations of the numbers getting in the way.

    1. Do a Diff between the last and current versions of these files, and verify that all the differences are either new characters, or were authorized to be changed by the UTC.

Review the generated data; compare files, use blankweights.sed or similar:

~/unitools/mine/Generated$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed ~/cldr/uni/src/common/uca/FractionalUCA.txt > ../frac-9.txt
~/unitools/mine/Generated$ sed -r -f ~/cldr/uni/src/tools/scripts/uca/blankweights.sed uca/10.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-10.txt && meld ../frac-9.txt ../frac-10.txt

Copy all generated files to unicode.org for review & staging by Ken & editors.

Once the files look good:

  • Make sure there is a CLDR ticket for the new UCA version.
  • Create a branch for it.
  • Copy the generated CollationAuxiliary/* files to the CLDR branch at common/uca/ and commit for review.
    ~/unitools/mine$ cp Generated/uca/15.0.0/CollationAuxiliary/* ~/cldr/uni/src/common/uca/
    
    Ignore files that were copied but are not version-controlled, that is, git status shows a question mark status for them.

UCA for previous version

Some of the tools code only works with the latest UCD/UCA versions. When I (Markus) worked on UCA 7 files while UCD/UCA 8 were under way, I set version 7.0.0 on the command line and made the following temporary (not committed to the repository) code changes:

Index: org/unicode/text/UCA/UCA.java

===================================================================

--- org/unicode/text/UCA/UCA.java (revision 742)

+++ org/unicode/text/UCA/UCA.java (working copy)

@@ -1354,7 +1354,7 @@

         {0x10FFFE},

         {0x10FFFF},

         {UCD_Types.CJK_A_BASE, UCD_Types.CJK_A_LIMIT},

-        {UCD_Types.CJK_BASE, UCD_Types.CJK_LIMIT},

+        {UCD_Types.CJK_BASE, 0x9FCC+1},  // TODO: restore for UCA 8.0!  {UCD_Types.CJK_BASE, UCD_Types.CJK_LIMIT},

         {0xAC00, 0xD7A3},

         {0xA000, 0xA48C},

         {0xE000, 0xF8FF},

@@ -1361,7 +1361,7 @@

         {UCD_Types.CJK_B_BASE, UCD_Types.CJK_B_LIMIT},

         {UCD_Types.CJK_C_BASE, UCD_Types.CJK_C_LIMIT},

         {UCD_Types.CJK_D_BASE, UCD_Types.CJK_D_LIMIT},

-        {UCD_Types.CJK_E_BASE, UCD_Types.CJK_E_LIMIT},

+        // TODO: restore for UCA 8.0!  {UCD_Types.CJK_E_BASE, UCD_Types.CJK_E_LIMIT},

         {0xE0000, 0xE007E},

         {0xF0000, 0xF00FD},

         {0xFFF00, 0xFFFFD},

Index: org/unicode/text/UCD/UCD.java

===================================================================

--- org/unicode/text/UCD/UCD.java (revision 743)

+++ org/unicode/text/UCD/UCD.java (working copy)

@@ -1345,7 +1345,7 @@

             if (ch <= 0x9FCC && rCompositeVersion >= 0x60100) {

                 return CJK_BASE;

             }

-            if (ch <= 0x9FD5 && rCompositeVersion >= 0x80000) {

+            if (ch <= 0x9FD5 && rCompositeVersion > 0x80000) {  // TODO: restore ">=" when really going to 8.0!

                 return CJK_BASE;

             }

             if (ch <= 0xAC00)

Index: org/unicode/text/UCD/UCD_Types.java

===================================================================

--- org/unicode/text/UCD/UCD_Types.java (revision 742)

+++ org/unicode/text/UCD/UCD_Types.java (working copy)

@@ -24,7 +24,7 @@

     // 4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;

     // 9FD5;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

     CJK_BASE = 0x4E00,

-    CJK_LIMIT = 0x9FD5+1,

+    CJK_LIMIT = 0x9FCC+1,  // TODO: restore for UCD 8.0!  0x9FD5+1,

 

     CJK_COMPAT_USED_BASE = 0xFA0E,

     CJK_COMPAT_USED_LIMIT = 0xFA2F+1,