This should be done several times during the Unicode beta process, as part of going from Unicode/UCA to CLDR to ICU. See the section “Unihan collators” in icu4c/source/data/unidata/changes.txt
Unicode Unihan tools code location: org/unicode/draft/GenerateUnihanCollators.java
There are text files in the same folder, for example patchPinyin.txt, that provide overrides for bug fixes.
:construction: TODO: Review the patch*.txt overrides and remove (comment out) ones that do not change the data any more because the Unihan data was updated. Probably do this in the tool: Detect that an override does not change the data.
Run org.unicode.draft.GenerateUnihanCollators
. This creates various files in $CLDR_DIR/../Generated/cldr/han
Many of these are log files or showing fixes to properties. The important results are
Code location: org/unicode/draft/GenerateUnihanCollatorFiles.java
Run org.unicode.draft.GenerateUnihanCollatorFiles
.
This merges #2-#4 into the common/collation/zh.xml. It reads from $CLDR_DIR/common/collation/zh.xml and writes to $CLDR_DIR/../Generated/cldr/han/replace/zh.xml.
It also merges #1 into common/transforms/Han-Latin.xml: $CLDR_DIR/common/transforms/Han-Latin.xml -> $CLDR_DIR/../Generated/cldr/han/replace/Han-Latin.xml.
After running the tool, compare the original with the output.
cd $CLDR_SRC meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
Copy the output back into the CLDR source tree.
cd $CLDR_SRC cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
Run CLDR unit tests. If the tests pass and the changes look good, then commit.
Details:
This tool searches for lines of the following form, and replaces all lines between them.
# START AUTOGENERATED <type> (<comment>) ... # END AUTOGENERATED <type> (<comment>)
An error is generated if the file contains none of these AUTOGENERATED files, or if there are mismatches in the type. The type is mapped to a filename using code in the file. Just follow the pattern that is there.