blob: 102ab84331346aa2f3fabfa47a68a6b90b5ce938 [file] [log] [blame] [view] [edit]
# Data Files Workflow
Starting with Unicode 15.1, the source of truth for not-yet-released versions of
most of the data files is in
https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/ucd/dev
and parallel .../uca/dev etc. folders.
## UCD/UCA/emoji/idna/security data files
First: Files with a special process.
### Unihan data
List of these files (see https://www.unicode.org/Public/UCD/latest/ucd/):
* USourceData.txt
* USourceGlyphs.pdf
* USourceRSChart.pdf
* Unihan.zip
Process:
* The source of truth is the Unihan database maintained by the CJK/Unihan group, including data maintained by Michel.
* The CJK/Unihan group posts data files into an internal location.
* KenW vets these files and posts them to https://www.unicode.org/Public/draft/UCD/ucd/ .
* A unicodetools GitHub contributor fetches these files, preprocesses the contents of Unihan.zip,
and creates a pull request as for regular data files.
(The processed data files go into .../unicodetools/data/ucd/dev/Unihan.)
### NamesList.txt
* Iterate between KenW and Michel.
* Generated from UnicodeData.txt and an annotations file, using some C program.
* Used for generating code charts.
* KenW posts NamesList.txt somewhere.
* A unicodetools GitHub contributor fetches this file
and creates a pull request as for regular data files.
### Folder readmes
The various ReadMe.txt files are checked into the unicodetools repo.
They are templatized, and the publication scripts below replace variables with the
Unicode and emoji versions, copyright year, and publication date (date when the script was run).
### “Regular” data files
Changes are made in a GitHub pull request.
* Contributors who dont use GitHub or dont contribute to the unicodetools project
can send changes (updated files or deltas) to someone who is comfortable using GitHub.
* Updated files could be shared in various ways including via email or via private FTP areas.
* Updated files should be based on the latest (or fairly recent) data in the unicodetools repo.
* Updated files should not be posted directly to https://www.unicode.org/Public/...
Pull request cycle:
* One commit for manual or contributed data changes.
* Another commit for tool code updates, generated/derived data files,
and for fixing test failures (continuous-integration tests or manual tests).
* Peer review of the changes.
* QA may include pushing files to a private area, e.g., for KenW to review.
* More commits as necessary until changes look good, tests pass,
and someone (preferably KenW) approves.
* **Squash**-and-merge.
* Or squash locally, force-push, and rebase-merge.
* Or, if you are comfortable with git, you may locally squash down to two commits,
one with the input changes and one with all of the generated changes and code fixes,
and rebase-merge.
### Notes about emoji data files
One difference here: Multiple stages, generating charts along the way.
Initial draft versions of eg annotations for candidates,
which is in addition to the current data for the next version of Unicode.
Once they get code points, they go into the regular files.
The emoji tools also read annotation data from CLDR and from the candidates file,
and use the CLDR emoji collation data (and interpolates the candidates data).
Another difference:
The charts are not normative; they get updated out of cycle, for example with new vendor images.
https://github.com/unicode-org/unicodetools/tree/main/unicodetools/data/emoji/dev
## Publication
Certain snapshots of the .../dev/ files are copied into https://www.unicode.org/Public/draft/
for Unicode alpha, beta, and final releases, and more as appropriate.
* UCD files go into https://www.unicode.org/Public/draft/UCD/
* UCA files go into https://www.unicode.org/Public/draft/UCA/
* emoji files go into https://www.unicode.org/Public/draft/emoji/
* etc.
* Inside draft there are no folder levels with version numbers.
Make sure to publish exactly the intended set of files.
Skip the NamesList.txt and Unihan data files (see above),
and skip any others that are only for internal use.
### Publish an alpha snapshot
For the alpha review, publish (at least) the UCD and emoji files, and the charts.
Review/edit the pub/*.sh scripts and advance the version numbers and copyright years.
Run the [pub/copy-alpha-to-draft.sh](https://github.com/unicode-org/unicodetools/blob/main/pub/copy-alpha-to-draft.sh)
script from an up-to-date repo workspace.
The script copies the set of the .../dev/ data files for an alpha snapshot
from a unicodetools workspace to a target folder with the layout of https://www.unicode.org/Public/draft/ .
Send the resulting zip file to Rick for posting to https://www.unicode.org/Public/draft/ .
Ask Rick to add other files that are not tracked in the unicodetools repo:
* Unihan.zip to .../draft/UCD/ucd
* alpha charts to .../draft/UCD/charts
Note: No version/delta infixes in names of data files.
We simply use the “draft” folder and the file-internal time stamps for versioning.
### Publish a beta snapshot
For the beta review, publish all of the data files, and the charts.
Run the [pub/copy-beta-to-draft.sh](https://github.com/unicode-org/unicodetools/blob/main/pub/copy-beta-to-draft.sh)
script from an up-to-date repo workspace.
The script copies the set of the .../dev/ data files for a beta snapshot
from a unicodetools workspace to a target folder with the layout of https://www.unicode.org/Public/draft/ .
Send the resulting zip file to Rick for posting to https://www.unicode.org/Public/draft/ .
Ask Rick to add other files that are not tracked in the unicodetools repo:
* Unihan.zip to .../draft/UCD/ucd
* UCDXML files to .../draft/UCD/ucdxml
* beta charts to .../draft/UCD/charts
### Publish a release snapshot
After the last UTC meeting for the release, collect all of the data file updates
(mostly from recently opened action items).
When complete, publish the draft files once more via the beta script.
Verify the final set of files in the draft folder.
Run the [pub/copy-final.sh](https://github.com/unicode-org/unicodetools/blob/main/pub/copy-final.sh)
script from an up-to-date repo workspace.
Send the resulting zip file to Rick for posting to https://www.unicode.org/Public/ (not .../Public/draft/).
Ask Rick to add other files that are not tracked in the unicodetools repo:
* Unihan.zip to .../{version}/ucd
* UCDXML files to .../{version}/ucdxml
* final charts to .../{version}/charts
This script works much like the beta script, except it:
* assembles all of the files for Public/ in their release folder structure,
rather than for Public/draft/
* creates a zipped/{version} folder with UCD.zip
### Before a release
When the data files are supposed to be final, about a week or two before the release:
Verify once more that the unicodetools repo .../dev/ files match the released/published files.
Create a release tag in the repo.
Example, from four days before Unicode 15.1 was released:
https://github.com/unicode-org/unicodetools/releases/tag/final-15.1-20230908
### After a release
Copy a snapshot of the unicodetools repo .../dev/ files to a versioned unicodetools folder;
for example: .../unicodetools/data/ucd/16.0.0/ .
(We no longer append a “-Update” suffix to the folder name.)
List: emoji, idna, security, uca, ucd, ucdxml
Watch for different naming conventions: emoji versions use only two fields, not three.
Edit the pub/*.sh scripts and advance the version numbers.
Change the Unicode Tools code as necessary for the start of work on the next version.
Settings.java lastVersion & latestVersion and more.
Example, Unicode 15.1→16.0: https://github.com/unicode-org/unicodetools/pull/539
Declare “main” to be open for the next version.