This file provides instructions for building and running the UnicodeTools, which can be used to:
WARNING!!
Some of the tasks within the Unicode Tools generate output files that can also be input files into other steps. For this purpose, we create a folder named Generated
to store these files. This folder can be a subfolder inside the local working copy root (called an “In-source build” workspace layout), or this folder can be outside (ex: a sibling folder) the local working-copy root (called an “Out-of-source build” workspace layout). Both workspace styles are described below.
Out-of-source builds keep a separation between source files of the repository and their generated output files, which are not tracked in the repository. Out-of-source builds allow developers to maintain a clean view of changes to tracked source files, without mixing generated output files. (Out-of-source builds are also useful for C++ repositories in which multiple configurations can be invoked to generate independent sets of makefiles that result in corresponding different output compiled binary files.)
mkdir -p unicodetools/mine/src mkdir -p cldr/mine/src
git clone https://github.com/unicode-org/unicodetools.git unicodetools/mine/src git clone https://github.com/unicode-org/cldr.git cldr/mine/src
Generated
folder structure as a sibling to the local working copy root:mkdir -p unicodetools/mine/Generated/BIN
git clone https://github.com/unicode-org/unicodetools.git git clone https://github.com/unicode-org/cldr.git
unicodetools
local working copy, create the Generated/BIN
folder structurecd unicodetools; mkdir -p Generated/BIN
Currently, some tests run on the generated output files of a tool (ex: in order to test the validity of the output files). After converting these tests into standard JUnit tests, these unit tests are then run in isolation by default. Our code has been updated to support this behavior because it now checks for generated files in the Generated
directory, and falls back to the repository's checked-in version when a command does not invoke the generation of a new version.
(Note: The following example values for Java system properties are paths to local working copies that are organized using the out-of-source build workspace layout, as described above.)
Property | Example Value |
---|---|
CLDR_DIR | /usr/local/google/home/mscherer/cldr/mine/src |
IMAGES_REPO_DIR | /usr/local/google/home/mscherer/images/mine/src |
UNICODETOOLS_REPO_DIR | /usr/local/google/home/mscherer/unitools/mine/src |
UNICODETOOLS_GEN_DIR | /usr/local/google/home/mscherer/unitools/mine/Generated |
UVERSION | 14.0.0 |
Like other projects, Unicode Tools uses a source formatter to ensure a consistent code style automatically, and it uses a single common formatter to avoid spurious diff noise in code reviews. This is now enforced via a formatter that is configured in the Maven build via a Maven plugin and checked by continuous integration on pull requests.
When creating pull requests, you can check the formatting locally using the command mvn spotless:check
. You can apply the formatter's changes using the command mvn spotless:apply
. Continuous integration errors for formatting can be fixed by committing the changes resulting from applying the formatter locally and pushing the new commit.
Some IDEs can integrate the formatter via plugins, which can minimize the need to manually run the formatter separately. The following links for specific IDEs may work:
android-formatting.xml
.android-formatting.xml
link mentioned for Eclipse (ex: "java.format.settings.url": "https://raw.githubusercontent.com/aosp-mirror/platform_development/master/ide/eclipse/android-formatting.xml",
). Also use the profile name corresponding to that XML file: (ex: "java.format.settings.profile": "Android",
).Maven
and Existing Maven Projects
don't appear as a top-level category and sub-option in the initial Import screen of the wizard, then the Eclipse plugin for Maven support has not been installed yet, and see above.Build and Test
${workspace_loc:/unicodetools-parent}
package
-ea
)For the tools to work, you need to set the JVM system properties according to your workspace layout. Depending on which tool you are running, you may need some or all of the properties listed above in General Setup for Maven.
For command-line users:
-Dvar1=path1 -Dvar2=path2 ...
For Eclipse users:
-Dvar1=path1 -Dvar2=path2 ...
.Please also enable assertions when running commands so that failed assertions don’t just slip through.
Command-line users:
MAVEN_OPTS
environment variable to include the -ea
JVM option in its string valueexport MAVEN_OPTS="-ea"; mvn compile exec:java -Dexec.mainClass=...
MAVEN_OPTS="-ea" mvn compile exec:java -Dexec.mainClass=...
Eclipse users:
-ea
(enable assertions) in your Preferences or in your Run/Debug configurationsAll commands must be run in the root of the unicodetools
repository local working copy directory.
Common tasks for Unicode Tools are listed below with example CLI commands with example argument values that they need:
Make Unicode Files:
Out-of-source build: mvn -s .github/workflows/mvn-settings.xml compile exec:java -Dexec.mainClass="org.unicode.text.UCD.Main" -Dexec.args="version 14.0.0 build MakeUnicodeFiles" -am -pl unicodetools -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=14.0.0
In-source build: MAVEN_OPTS="-ea" mvn compile exec:java -Dexec.mainClass="org.unicode.text.UCD.Main" -Dexec.args="version 14.0.0 build MakeUnicodeFiles" -am -pl unicodetools -DCLDR_DIR=$(cd ../cldr ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd Generated; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=14.0.0
Build and Test:
Out-of-source build: MAVEN_OPTS="-ea" mvn package -DCLDR_DIR=$(cd ../../../cldr/mine/src ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd ../Generated ; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=14.0.0
In-source build: MAVEN_OPTS="-ea" mvn package -DCLDR_DIR=$(cd ../cldr ; pwd) -DUNICODETOOLS_GEN_DIR=$(cd Generated; pwd) -DUNICODETOOLS_REPO_DIR=$(pwd) -DUVERSION=14.0.0
See the corresponding Github Actions Continuous Integration workflow file to see other commonly used tools and specifics on how to invoke them at the command line.
For each individual command in Unicode Tools described above, you can configure a Launch Configuration in one of two ways.
UCD Make Unicode Files
)-am -pl unicodetools compile exec:java
(the argument for the subproject list flag -pl
assumes that the class with the main method is in the subdirectory unicodetools/src/main/java
)exec.mainClass
, value = "org.unicode.text.UCD.Main"
; name = exec.args
, value = "version 15.0.0 build MakeUnicodeFiles"
)UCD Make Unicode Files
)unicodetools
org.unicode.text.UCD.Main
)version 15.0.0 build MakeUnicodeFiles
)-ea
)Error: Could not find or load main class org.unicode.text.UCD.Main Caused by: java.lang.ClassNotFoundException ...
, then you must run the Build and Test run config for Maven to build the yet-uncompiled Java classes into ./unicodetools/target/classes
:point_right: Note: This is a mess. See https://unicode-org.atlassian.net/browse/ICU-21757
See the top level pom.xml
under <properties>
.
icu.version
in the top level pom.xml
to the version string, such as 70.0.1-SNAPSHOT-cldr-2021-09-15
cldr.version
in the top level pom.xml
to this version string, which has 0.0.0 and a git hash in it, such as 0.0.0-SNAPSHOT-bfa39570be
pom.xml
40.0-SNAPSHOT
cldr.version
to 40.0-SNAPSHOT
and this version will be used.The input data files for the Unicode Tools are checked into the repo since 2012-dec-21:
This is inside the unicodetools file tree, and the Java code has been updated to assume that. Any old Eclipse setup needs its path variables checked.
For details see Input data setup.
To generate new data files, you can run the org.unicode.text.UCD.Main
class (yes, the Main
class has a main()
function) with program arguments build MakeUnicodeFiles
. You may optionally include e.g. version 14.0.0
if you wish to just generate the files for a single version. Make sure you have the VM arguments set up as described above.
Starting with Unicode 15, we are developing most of the Unicode data files in this Unicode Tools project, and publish them to the Public folder only for alpha/beta/final releases. That is, we are reversing the flow of files.
See data workflow. (Based on issue #144.)
We are also no longer generating and posting files with version suffixes. (We now generate files into an output folder with the Unicode version number.)
Except: Some files, such as Unihan and ucdxml data files, are developed elsewhere, and we continue to ingest them as before.
Starting with Unicode 15, we keep the latest versions of data files in unversioned “dev” folders in this repo.
See data workflow.
All of the following have version 15.0.0
(or whatever the latest version is) in the options given to Java.
Example changes for adding Unicode 15 version numbers: See the second commit of https://github.com/unicode-org/unicodetools/pull/156. Also, you must update the version number in the CI build scripts in .github/workflows/
.
Example changes for adding properties: https://github.com/unicode-org/unicodetools/pull/40. Throughout these steps we will walk through updating unicodetools to support Unicode 15 or 14.
Firstly, fetch the latest data files for this version from https://www.unicode.org/Public/14.0.0/ucd/, matching your new version number. If this does not exist, request this be created from ucd-dev@googlegroups.com. You may also need to fetch the emoji files from https://www.unicode.org/Public/emoji/13.1, using a previous version if a new one does not exist.
You may need to use the tools from Input data setup to desuffix the files (removing the -dN suffixes). Copy these into unicodetools/data/emoji/14.0
and unicodetools/data/ucd/14.0.0-Update
.
to set up the inputs correctly. For some updates you may need to pull in other (uca, security, idna, etc) files, see Input data setup for more information.
Now, update the following files:
MakeUnicodeFiles.txt
(find in Eclipse via Navigate/Resource or Ctrl+Shift+R)
Generate: .* CopyrightYear: 2021 (or whatever) .... File: DerivedAge ..... add a value for the latest version at the bottom: Value: V14_0
Update String[] LONG_AGE
and String[] SHORT_AGE
in UCD_Names.java
.
Update latestVersion
and lastVersion
in org.unicode.text.utility.Settings.java
to fix:
public static final String latestVersion = "14.0.0"; public static final String lastVersion = "13.1.0"; // last released version
Update LIMIT_AGE
and AGE_VERSIONS
in UCD_Types.java
.
Update enum AGE_Values
in UcdPropertyValues.java
.
Update searchPath
in org.unicode.text.utility.Utility.java
.
If there are new CJK characters (if there are changes to entries in UnicodeData.txt that are for <CJK Ideograph ..., First>
etc.), UCD.java
and UCD_Types.java
need to be updated to handle these ranges. See PR #171 and PR #47 for examples.
For CJK, you'll first need to compute the composite version, as (major << 16) | (minor << 8) |
update. E.g. Unicode 14 is 0xe0000. Since the ranges change based on the version, the code here needs to be updated in a version-aware way.
If any range has changed its end point, say, CJK Extension C, update CJK_C_LIMIT
in UCD_Types.java
(make sure to update the comment next to it with the latest Unicode version).
Then edit mapToRepresentative()
in UCD.java
to add the range. Make sure the range is added only for the latest Unicode version, by using sections like if (ch <= 0x2B737 && rCompositeVersion >= 0xe0000)
.
If a new range has been introduced, add it to UCD_Types.java
near CJK_E_BASE
, add it to mapToRepresentative()
, update hasComputableName
and get()
in UCD.java
to add the first character.
Also search (case-insensitively) unicodetools for 2A700 (start of Extension C) and add the new range accordingly.
When CJK_LIMIT
moves, search for 9FCC and update near there as necessary.
If the main Tangut block has been extended, then in UCD.java
mapToRepresentative()
add another per-version block for returning TANGUT_BASE
.
You can now run the steps in “Generating new data” above to attempt to generate the files. It will likely error due to missing enum values for new blocks and scripts.
Compare Blocks.txt to the old version (or check the errors from your attempt to generate new files). For all the new ones:
ShortBlockNames.txt
(you need to know what the short name is, you can find it in PropertyValueAliases.txt
)UcdPropertyValues.java
enum Block_Values
ShortBlockNames
and see if you still get errors.UcdPropertyValues.java
enum Script_Values
, in alphabetical orderUCD_Types.java
below SCRIPT_CODE
, in alphabetical order grouped by Unicode version. Update LIMIT_SCRIPT
to use the name of the new last scriptSCRIPT
and LONG_SCRIPT
in UCD_Names.java
, in alphabetical order grouped by Unicode version. (Important: this must be in the same order as the previous one.)DerivedAge.txt
lines for the new version, copy them into the input Scripts.txt
file, and change the new version number to the appropriate script (which can be new or old or Common
etc.). Then run UCD Main again and check the generated Scripts.txt
.Make a pull request to incorporate these updates, and upload the generated files in a way that can be shared with ucd-dev.
Unicode 15+:
... instead of posting draft files elsewhere and re-ingesting them later.
Ideally, diff the files to check for any discrepancies. The script will do this automatically, you can search the output for lines that say “Found difference in <filename>
”, however note that it will only display the first line of the diff, so if there are additional discrepancies you may miss them.
When you run, it will break if there are new enum property values.
Note: For more information and newer code see the pages
To fix that:
Go into org.unicode.text.UCD/
UCD_Names.java
andUCD_Types.java
(These contain ugly items that should be enums nowadays.)
Find the property (easiest is to search for some other properties in the enum). Add at end in UCD_Types
. Be sure to update the limit, like
LIMIT_SCRIPT = Mandaic + 1;
Then in UCD_Names
, change the corresponding name entry, both the full and abbreviated names. Follow the format of the existing values.
For example:
In UCDNames.java
in BIDI_CLASS
add "LRI", "RLI", "FSI", "PDI",
In UCDNames.java
in LONG_BIDI_CLASS
add "LeftToRightIsolate", "RightToLeftIsolate", "FirstStrongIsolate", "PopDirectionalIsolate",
In UCD_Types.java
add & adjust
BIDI_LRI = 20, BIDI_RLI = 21, BIDI_FSI = 22, BIDI_PDI = 23, LIMIT_BIDI_CLASS = 24;
Some changes may cause collisions in the UnicodeMaps used for derived properties. You'll find that out with an exception like:
Exception in thread “main” java.lang.IllegalArgumentException: Attempt to reset value for 17B4 when that is disallowed. Old: Control; New: Extend at org.unicode.text.UCD.ToolUnicodePropertySource$28.<init>(ToolUnicodePropertySource.java:578)
Add new scripts like other new property values. In addition, make sure there are ISO 15924 script codes, and collect CLDR script metadata. See
http://cldr.unicode.org/development/updating-codes/updating-script-metadata
http://www.unicode.org/iso15924/codechanges.html
If there are new break rules (or changes), see Segmentation-Rules.
MakeUnicodeFiles.txt
This file drives the production of the derived Unicode files. The first three lines contain parameters that you may want to modify at some times:
Generate: .*script.* // this is a regular expression. Use .* for all files CopyrightYear: 2010 // Pick the current year
build MakeUnicodeFiles
version 6.3.0
or similar.You'll see it build the 5.0 files, with something like the following results:
Writing UCD_Data Data Size: 109,802 Wrote Data 109802
For each version, the tools build a set of binary data in BIN that contain the information for that release. This is done automatically, or you can manually do it with the Program Arguments
As options, use: version 5.0.0 build
This builds a compressed format of all the UCD data (except blocks and Unihan) into the BIN directory. Don't worry about the voluminous console messages, unless one says “FAIL”.
You have to manually do this if you change any of the data files in that version! This ought to have build files, but I haven't worked around to it.
Note: if for any reason you modify the binary format of the BIN files, you also have to bump the value in that file:
static final byte BINARY_FORMAT = 8; // bumped if binary format of UCD changes
Diff_PropList-5.0.0d10.txt.bat OLDER-Diff_PropList-5.0.0d10.txt.bat UNCHANGED-Diff_PropertyValueAliases-5.0.0d10.txt.bat
Unicode 15+: See above; commit new input data, run tools, review output, copy back to input, commit, pull request...
We no longer post files to FTP folders, nor publish individual files without consistent changes in others.
Note: Also build and run the New Unicode Properties programs, since they have some additional checks.
Open in Package Explorer
Run>Run As... Java Application
Will create the following file of results:
{Generated}/UnicodeTestResults.txt
Options:
The console output shows whether any problems are found. Thus in the following case there was one failure:
ParseErrorCount=0 TestFailureCount=1
Note that since 2022-May (Unicode 15) we have a TestTestUnicodeInvariants
JUnit wrapper that runs TestUnicodeInvariants
with default options, and which is one of our CI build bot tests.
The header of the result file explains the syntax of the tests.
Open that file and search for **** START Test Failure
.
Each such point provides a dump of comparison information.
# Canonical decompositions (minus exclusions) must be identical across releases [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] = [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion] FALSE **** START Error Info **** In [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion], but not in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] : # Total code points: 0 Not in [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion], but in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] : 1B06 # Lo BALINESE LETTER AKARA TEDUNG 1B08 # Lo BALINESE LETTER IKARA TEDUNG 1B0A # Lo BALINESE LETTER UKARA TEDUNG 1B0C # Lo BALINESE LETTER RA REPA TEDUNG 1B0E # Lo BALINESE LETTER LA LENGA TEDUNG 1B12 # Lo BALINESE LETTER OKARA TEDUNG 1B3B # Mc BALINESE VOWEL SIGN RA REPA TEDUNG 1B3D # Mc BALINESE VOWEL SIGN LA LENGA TEDUNG 1B40..1B41 # Mc [2] BALINESE VOWEL SIGN TALING TEDUNG..BALINESE VOWEL SIGN TALING REPA TEDUNG 1B43 # Mc BALINESE VOWEL SIGN PEPET TEDUNG # Total code points: 11 In both [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion], and in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] : 00C0..00C5 # L& [6] LATIN CAPITAL LETTER A WITH GRAVE..LATIN CAPITAL LETTER A WITH RING ABOVE 00C7..00CF # L& [9] LATIN CAPITAL LETTER C WITH CEDILLA..LATIN CAPITAL LETTER I WITH DIAERESIS 00D1..00D6 # L& [6] LATIN CAPITAL LETTER N WITH TILDE..LATIN CAPITAL LETTER O WITH DIAERESIS ... 30F7..30FA # Lo [4] KATAKANA LETTER VA..KATAKANA LETTER VO 30FE # Lm KATAKANA VOICED ITERATION MARK AC00..D7A3 # Lo [11172] HANGUL SYLLABLE GA..HANGUL SYLLABLE HIH # Total code points: 12089 **** END Error Info ****
The input file is unicodetools/src/main/resources/org/unicode/text/UCD/UnicodeInvariantTest.txt.
-DSHOW_FILES
unicode-test-results
in the “Artifacts” section.Instructions moved to the uca tools main page.
To build all the charts, use org.unicode.text.UCA.Main, with the option:
charts
They will be built into
http://unicode.org/draft/charts/
Once UCA is released, then copy those files up to the right spots in the Unicode site: