blob: 51eac727d8c52af269411e50cf98c12b4fe50e4e [file] [log] [blame] [view] [edit]
# Unihan for CLDR
## Run GenerateUnihanCollators
This should be done several times during the Unicode beta process, as part of
going from Unicode/UCA to CLDR to ICU. See the section "Unihan collators" in
[icu4c/source/data/unidata/changes.txt](https://github.com/unicode-org/icu/blob/main/icu4c/source/data/unidata/changes.txt)
Unicode Unihan tools code location:
[org/unicode/draft/GenerateUnihanCollators.java](https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/draft/GenerateUnihanCollators.java)
There are text files in the same folder, for example patchPinyin.txt, that
provide overrides for bug fixes.
:construction: **TODO**: Review the patch\*.txt overrides and remove (comment out) ones that do not
change the data any more because the Unihan data was updated. Probably do this
in the tool: Detect that an override does not change the data.
Run `org.unicode.draft.GenerateUnihanCollators`.
This creates various files in $CLDR_DIR/../Generated/cldr/han
Many of these are log files or showing fixes to properties. The important
results are
1. Han-Latin.txt
2. strokeT.txt
3. strokeT_short.txt
4. pinyin.txt
5. pinyin_short.txt
## Run GenerateUnihanCollatorFiles
Code location:
[org/unicode/draft/GenerateUnihanCollatorFiles.java](https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/draft/GenerateUnihanCollatorFiles.java)
Run `org.unicode.draft.GenerateUnihanCollatorFiles`.
This merges #2-#4 into the common/collation/zh.xml. It reads from
$CLDR_DIR/common/collation/zh.xml and writes to
$CLDR_DIR/../Generated/cldr/han/replace/zh.xml.
It also merges #1 into common/transforms/Han-Latin.xml:
$CLDR_DIR/common/transforms/Han-Latin.xml ->
$CLDR_DIR/../Generated/cldr/han/replace/Han-Latin.xml.
After running the tool, compare the original with the output.
```
cd $CLDR_SRC
meld common/collation/zh.xml ../Generated/cldr/han/replace/zh.xml
meld common/transforms/Han-Latin.xml ../Generated/cldr/han/replace/Han-Latin.xml
```
Copy the output back into the CLDR source tree.
```
cd $CLDR_SRC
cp ../Generated/cldr/han/replace/zh.xml common/collation/zh.xml
cp ../Generated/cldr/han/replace/Han-Latin.xml common/transforms/Han-Latin.xml
```
Run CLDR unit tests.
If the tests pass and the changes look good, then commit.
Details:
This tool searches for lines of the following form, and replaces all lines between
them.
```
# START AUTOGENERATED <type> (<comment>)
...
# END AUTOGENERATED <type> (<comment>)
```
An error is generated if the file contains none of these AUTOGENERATED files, or
if there are mismatches in the type. The type is mapped to a filename using code
in the file. Just follow the pattern that is there.