To add or fix xidmodifications, look at source/removals.txt.
To add or fix confusables, there are multiple source files. Many were machine-generated, then tweaked. They have names like source/confusables-winFonts.txt. The main file is confusables-source.txt.
There is fairly complex processing for the confusables, so carefully diff the results. Sometimes you may get an unexpected union of two equivalence sets. Look at Testing below for help.
Look at the following spreadsheets / bugs to see if there are any additional suggestions.
If so, assess and add to unicodetools/data/security/{version}/data/source/confusables-source.txt — if needed. Then in the spreadsheets, move the “new stuff” line to the end.
There is a brief description of the file format at the top. Each line represents a mapping from a code point or set of code points to a sequence of one or more code points.
For example:
0021 ; 01C3 # ( ! → ǃ) EXCLAMATION MARK → LATIN LETTER RETROFLEX CLICK
The ordering of characters doesn‘t matter. So it doesn’t matter whether you have the above line, or
01C3 ; 0021 # ( ǃ → !) LATIN LETTER RETROFLEX CLICK → EXCLAMATION MARK
It also doesn't matter if you have identical lines; the second one will be a NOOP.
The mappings are used to generate equivalence classes. From each equivalence class, one representative member will be chosen, and in the resulting data file, all the other characters will map to that representative. Because of transitivity, the equivalence class will tend to be somewhat looser than expected.
We've discussed possible future enhancements:
First, in CLDR, update the script metadata: http://cldr.unicode.org/development/updating-codes/updating-script-metadata
The identifier type & status take this data into account.
Fix the version string (which will appear inside GenerateConfusables.java) and the REVISION (which will match the new directory).
The version/revision strings are shared with other tools; no need to set them separately.
Run GenerateConfusables -c -b to generate the files. They will appear in two places.
The TestSecurity.java test is part of the unit test suite, run by a github CI. It verifies that the confusable mappings are idempotent.
Copy the following from the output directory to the top level of the revision directory, and check in.
Review the mappings to make sure that there are no surprises. The biggest issue is if two equivalence classes are mistakenly joined. For example, if you map b to d, then that will join the equivalence class for b with that of d.
Markus 2020-feb-07 for Unicode 13.0:
You may see Identifier_Type=Recommended for characters/scripts/blocks that should not be recommended. For example, the initial generation for Unicode 14 “recommended” Znamenny combining marks. Add these to unicodetools/data/security/{version}/data/source/removals.txt. You can use block properties like
\p{block=Znamenny_Musical_Notation} ; technical
We should preserve the target from old versions wherever possible. For example, when the 6.3.0 files were first done, the following reversed order:
0259 ; 01DD ; MA # ( ə → ǝ ) LATIN SMALL LETTER SCHWA → LATIN SMALL LETTER TURNED E #
That was because the LATIN SMALL LETTER TURNED E changed identifier status (to become better). Since stability of the ordering is important, that was fixed with the following change.
// EXCEPTIONAL CASES // added to preserve source-target ordering in output. lowerIsBetter.put('\u0259', MARK_NFC);
Where Mark_NFC
was the former status. At some point, the code should be modified to read the older version of the file, and favor characters that were there as targets, but for now there are few enough of these that it is simple enough to just add them to this list.
After making any changes:
Because of transitive closure, it is sometimes tricky to track down why two items are marked as confusable. The transitive closure not only does x ~ y, y ~ z, therefore x ~ z, but also handled substrings. So if x ~ y, then ax ~ ay. You can end up with conflicts, like if you have x => "", and someplace else x => y, or if you have x ~ xy (and y !~ "").
In confusables.txt, each line that is the product of transitive closure shows you a path after a second #.
248F ; 0038 005F ; SA #\* ( ⒏ → 8_ ) DIGIT EIGHT FULL STOP → DIGIT EIGHT, LOW LINE # →8.→
Find the link in the chain that shouldn't be there. Sometimes that is because of a substring mapping. In the above case, it is mapping _ to .
Then search through the source/ for that character, to see what is happening. Sometimes the formatted-xxx.txt is easier to search, since it has both the hex and the character.
Searching for a regex expression that contains both the literal characters and the hex is useful. For example, if you see the line:
# ර ? ( ර ) 0DBB SINHALA LETTER RAYANNA ← ( ? ) 0DEE SINHALA LITH DIGIT EIGHT
Then do a regex search in /data/source on [ර?]|ODBB|ODEE
Some problems can arise when the NFKC form is very different, like for:
|| cp == '﬩' || cp == '︒'
In those cases, modify getSkipNFKD.
Other problems can arise from:
[1234 1235]
instead of [\u1234 \u1235]
\u0031\u0061
, not as \u001A
.Illegal containment: U+0645 ARABIC LETTER MEEM, U+062D ARABIC LETTER HAH, U+005F LOW LINE overlaps U+0645 ARABIC LETTER MEEM, U+062D ARABIC LETTER HAH
from U+0645 ARABIC LETTER MEEM, U+062D ARABIC LETTER HAH, U+0640 ARABIC TATWEEL
with reason [[arabic]] plus [[arabic]→𞻰, [arabic]]
Once you've resolved all the problems, copy certain generated files to https://www.unicode.org/Public/security/{version}/
Check that the files are copied to https://www.unicode.org/Public/security/{version}/.