Unihan for CLDR

Run GenerateUnihanCollators

This should be done several times during the Unicode beta process, as part of going from Unicode/UCA to CLDR to ICU. See the section “Unihan collators” in icu4c/source/data/unidata/changes.txt

Unicode Unihan tools code location: org/unicode/draft/GenerateUnihanCollators.java

There are text files in the same folder, for example patchPinyin.txt, that provide overrides for bug fixes.

:construction: TODO: Review the patch*.txt overrides and remove (comment out) ones that do not change the data any more because the Unihan data was updated. Probably do this in the tool: Detect that an override does not change the data.

Run org.unicode.draft.GenerateUnihanCollators. This creates various files in $CLDR_DIR/../Generated/cldr/han

Many of these are log files or showing fixes to properties. The important results are

  1. Han-Latin.txt
  2. strokeT.txt
  3. strokeT_short.txt
  4. pinyin.txt
  5. pinyin_short.txt

Run GenerateUnihanCollatorFiles

Code location: org/unicode/draft/GenerateUnihanCollatorFiles.java

Run org.unicode.draft.GenerateUnihanCollatorFiles.

This merges #2-#4 into the common/collation/zh.xml. It reads from $CLDR_DIR/common/collation/zh.xml and writes to $CLDR_DIR/../Generated/cldr/han/replace/zh.xml.

It also merges #1 into common/transforms/Han-Latin.xml: $CLDR_DIR/common/transforms/Han-Latin.xml -> $CLDR_DIR/../Generated/cldr/han/replace/Han-Latin.xml.

After running the tool, compare the original with the output.

cd $CLDR_SRC
meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml

Copy the output back into the CLDR source tree.

cd $CLDR_SRC
cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml

Run CLDR unit tests. If the tests pass and the changes look good, then commit.

Details:

This tool searches for lines of the following form, and replaces all lines between them.

# START AUTOGENERATED <type> (<comment>)
...
# END AUTOGENERATED <type> (<comment>)

An error is generated if the file contains none of these AUTOGENERATED files, or if there are mismatches in the type. The type is mapped to a filename using code in the file. Just follow the pattern that is there.