Create new revision directory, such as .../unicodetools/data/security/6.3.0. The folder will match the version of the UCD used (perhaps with an incrementing 3rd field).
git cp
to copy the previous directory to the new one. Do not just “mkdir” and copy the files!To add or fix xidmodifications, look at source/removals.txt.
To add or fix confusables, there are multiple source files. Many were machine-generated, then tweaked. They have names like source/confusables-winFonts.txt. The main file is confusables-source.txt.
There is fairly complex processing for the confusables, so carefully diff the results. Sometimes you may get an unexpected union of two equivalence sets. Look at Testing below for help.
Look at the following spreadsheets / bugs to see if there are any additional suggestions.
If so, assess and add to unicodetools/data/security/{version}/data/source/confusables-source.txt — if needed.
Then in the spreadsheets, move the “new stuff” line to the end.
First, in CLDR, update the script metadata: http://cldr.unicode.org/development/updating-codes/updating-script-metadata
The identifier type & status take this data into account.
Fix the version string (which will appear inside GenerateConfusables.java) and the REVISION (which will match the new directory).
The version/revision strings are shared with other tools; no need to set them separately.
Run GenerateConfusables -c -b to generate the files. They will appear in two places.
Run TestSecurity to verify that the confusable mappings are idempotent!
With the same VM arguments as the generator. Starting in 2021q3, TestSecurity needs to be run as a JUnit test. It is also now part of the unit test suite and run on GitHub CI.
Copy the following from the output directory to the top level of the revision directory:
Markus 2020-feb-07 for Unicode 13.0:
You may see Identifier_Type=Recommended for characters/scripts/blocks that should not be recommended. For example, the initial generation for Unicode 14 “recommended” Znamenny combining marks. Add these to unicodetools/data/security/{version}/data/source/removals.txt. You can use block properties like
\p{block=Znamenny_Musical_Notation} ; technical
We should preserve the target from old versions wherever possible. For example, when the 6.3.0 files were first done, the following reversed order:
0259 ; 01DD ; MA # ( ə → ǝ ) LATIN SMALL LETTER SCHWA → LATIN SMALL LETTER TURNED E #
That was because the LATIN SMALL LETTER TURNED E changed identifier status (to become better). Since stability of the ordering is important, that was fixed with the following change.
// EXCEPTIONAL CASES // added to preserve source-target ordering in output. lowerIsBetter.put('\u0259', MARK_NFC);
Where Mark_NFC
was the former status. At some point, the code should be modified to read the older version of the file, and favor characters that were there as targets, but for now there are few enough of these that it is simple enough to just add them to this list.
After making any changes:
Because of transitive closure, it is sometimes tricky to track down why two items are marked as confusable. The transitive closure not only does x ~ y, y ~ z, therefore x ~ z, but also handled substrings. So if x ~ y, then ax ~ ay. You can end up with conflicts, like if you have x => "", and someplace else x => y, or if you have x ~ xy (and y !~ "").
In confusables.txt, each line that is the product of transitive closure shows you a path after a second #.
248F ; 0038 005F ; SA #\* ( ⒏ → 8_ ) DIGIT EIGHT FULL STOP → DIGIT EIGHT, LOW LINE # →8.→
Find the link in the chain that shouldn't be there. Sometimes that is because of a substring mapping. In the above case, it is mapping _ to .
Then search through the source/ for that character, to see what is happening. Sometimes the formatted-xxx.txt is easier to search, since it has both the hex and the character.
Searching for a regex expression that contains both the literal characters and the hex is useful. For example, if you see the line:
# ර ? ( ර ) 0DBB SINHALA LETTER RAYANNA ← ( ? ) 0DEE SINHALA LITH DIGIT EIGHT
Then do a regex search in /data/source on [ර?]|ODBB|ODEE
Some problems can arise when the NFKC form is very different, like for:
|| cp == '﬩' || cp == '︒'
In those cases, modify getSkipNFKD.
Other problems can arise from:
[1234 1235]
instead of [\u1234 \u1235]
\u0031\u0061
, not as \u001A
.Illegal containment: U+0645 ARABIC LETTER MEEM, U+062D ARABIC LETTER HAH, U+005F LOW LINE overlaps U+0645 ARABIC LETTER MEEM, U+062D ARABIC LETTER HAH
from U+0645 ARABIC LETTER MEEM, U+062D ARABIC LETTER HAH, U+0640 ARABIC TATWEEL
with reason [[arabic]] plus [[arabic]→𞻰, [arabic]]
Once you've resolved all the problems, copy certain generated files to https://www.unicode.org/Public/security/{version}/
Check that the files are copied to https://www.unicode.org/Public/security/{version}/.