Changing UCD Properties & Printout

Simple Properties

The properties built directly from files have constants defined in UCD_Types/Names. You normally have to add a constant and update a LIMIT value in UCD_Types, and then add a long name and a short name in UCD_Names.

Some of this is outdated; see also the New Unicode Properties page. Sometimes you have to make changes in both the old & new mechanisms.

Examples:

  // line break
public static final byte
...
LB_ZJ = 39,
LIMIT_LINE_BREAK = 40,

static final String[] LINE_BREAK = {
  ..., "ZJ"

static final String[] LONG_LINE_BREAK = {
  ...,
  "Zero_Width_Joiner"

Blocks

Blocks are special. You need to modify ShortBlockNames.txt for the short name.

Derived Properties

The derived properties are found in a couple of places (history...), DerivedProperties.java and ToolUnicodePropertySource.java. Easiest way to find them is to search the /UCD/ directory for a word in the property name.

ToolUnicodePropertySource.java

Most are here. For example, to change WB, look there for word_break. You'll add a new property just by adding values to a unicodeMap.

Remember to remove the character from the old property, since it checks for collisions. Example:

unicodeMap.putAll(getProperty("Grapheme_Extend").getSet(UCD_Names.YES).addAll(
                  cat.getSet("Spacing_Mark")).remove(0x200D), "Extend");

unicodeMap.put('\u200D', "Joiner");

unicodeMap.putAll(0x1F1E6, 0x1F1FF, "After_Joiner");

If you fail, you'll get a message like:

Exception in thread "main" java.lang.UnsupportedOperationException: Attempt to reset value for 200D when that is disallowed. Old: Extend; New: Joiner
at com.ibm.icu.dev.test.util.UnicodeMap._put(UnicodeMap.java:280)
at com.ibm.icu.dev.test.util.UnicodeMap.put(UnicodeMap.java:376)

To add the alias, look down for the setMain. You'll find a list of items, and just add to it:

{ "Joiner", "J" }, { "After_Joiner", "AJ" },

DerivedProperty.java

Some derived properties are here, such as DefaultIgnorable. If you search for the name, you'll see something like:

dprops[DefaultIgnorable] = new UCDProperty() {
    {
        type = DERIVED_CORE;
        name = "Default_Ignorable_Code_Point";
        hasUnassigned = true;
        shortName = "DI";
    }

Segmentation Rules

Changing Segmentation rules involves more work. Open Segmenter.java. Scroll to the bottom to see the rules.

2019nov21: This appears to be outdated. The rules are in two text files now:

Add new variables, like:

"$J=\\p{Grapheme_Cluster_Break=Joiner}",

"$AJ=\\p{Grapheme_Cluster_Break=After_Joiner}",

In everything but Grapheme_Clusters, also add a rewrite. For example, for SJ there is:

"$SY=\\p{Line_Break=Break_Symbols}",

but then lower down you also need to add:

"$SY=($SY $X)",

Translate the rule into the right syntax:

GB8b. Joiner × After_Joiner

becomes:

"8.1) $Joiner \u00D7 $After_Joiner",

It may be useful to look at the history of the file to see the kinds of rule changes that are made!

For example: http://unicode.org/utility/trac/changeset/1009 which became git commit f130761

Adding Segmentation Sample Strings

To add new sample strings, go to GenerateBreakTest. You'll find code that looks like the following. Add the strings you want as samples to one of the list of strings (see below for an example).

static class GenerateGraphemeBreakTest extends XGenerateBreakTest {

    public GenerateGraphemeBreakTest(UCD ucd) {
        super(ucd, Segmenter.make(..., "Grapheme",
            new String[]{
              unicodePropertySource.getSet("GC=Cn").iterator().next(),
              "\uD800",
              // ADD HERE FOR SAMPLES USED IN COMBINATION
            }, 
            new String[]{
              // ADD HERE FOR SAMPLES USED INDIVIDUALLY
        });
    } 
}

The difference between the two types of samples is that the first are used in combination in the tests; if you add one string, then you more than double the number of test cases.

If you want to add strings that show up as samples in the .html file, then you add to the second location.

Property(Value)Aliases

If you look in MakeUnicodeFile.txt, there's a listing of properties and formats for generating the files. Certain files are “SPECIAL”, which means that in MakeUnicodeFile.java there is special code for them, such as:

public static void generateFile(String filename) throws IOException {
    if (filename.endsWith("Aliases")) {
        if (filename.endsWith("ValueAliases")) {
            generateValueAliasFile(filename);
        } else {
            generateAliasFile(filename);
        }
    } else {

So that requires a change in the code (eg in generateValueAliasFile)