Split EMOJI_MODIFIER_BASE and expand emoji presentation coverage (#18)

* Split EMOJI_MODIFIER_BASE into text and emoji presentation variants

Distinguish between modifier bases that default to text presentation
and those that default to emoji presentation. Bare text-default modifier
bases are no longer classified as emoji presentation, aligning with UTS #51.

* Add standalone emoji modifier and regional indicator to emoji presentation

Both EMOJI_MODIFIER and REGIONAL_INDICATOR have Emoji_Presentation=Yes
in Unicode data, so they should be classified as emoji presentation
even when they appear outside of a modifier or flag sequence.

* Add unqualified keycap sequence to emoji presentation

Recognize keycap sequences without VS16 (e.g., # + U+20E3) as emoji
presentation. Previously only the fully qualified form with VS16
between the keycap base and combining enclosing keycap was matched.

* Regenerate emoji_presentation_scanner.c from updated Ragel grammar

* Update README classifier example for modifier base split

* Add test harness with Makefile integration

Introduces emoji_test.c, emoji_test_data.pl, and a make test target.
Test data is generated by the Perl script using Unicode properties to
classify codepoints, then compiled into a C harness that validates
scan_emoji_presentation() output for each sequence.
7 files changed
tree: e4246d4a70e9483ab1bb587e1d3c210694b88fa9
  1. CONTRIBUTING.md
  2. emoji_presentation_scanner.c
  3. emoji_presentation_scanner.rl
  4. emoji_test.c
  5. emoji_test_data.h
  6. emoji_test_data.pl
  7. LICENSE
  8. Makefile
  9. NEWS
  10. README.md
README.md

Emoji Segmenter

This repository contains a Ragel grammar and generated C code for segmenting runs of text into text-presentation and emoji-presentation runs. It is currently used in projects such as Chromium and Pango for deciding which preferred presentation, color or text, a run of text should have.

The goal is to stay very close to the grammar definitions in Unicode Technical Standard #51.

API

By including the emoji_presentation_scanner.c file, you will be able to call the following API

scan_emoji_presentation (emoji_text_iter_t p,
    const emoji_text_iter_t pe,
    bool* is_emoji,
    bool* has_vs)

This API call will scan emoji_text_iter_t p for the next grammar-token and return an iterator that points to the end of the next token. An end iterator needs be specified as pe so that the scanner can compare against this and knows where to stop. In the reference parameter is_emoji it returns whether this token has emoji-presentation text-presentation, has_vs is set to true if the token contains a variation selector.

A grammar token is either a combination of an emoji plus variation selector 15 for text presentation, an emoji presentation sequence (emoji + VS16), an emoji presentation emoji or emoji sequence, or a single text presentation character.

emoji_text_iter_t is an iterator type over a buffer of the character classes that are defined at the beginning of the the Ragel file, e.g. EMOJI, EMOJI_TEXT_PRESENTATION, REGIONAL_INDICATOR, KEYCAP_BASE, etc.

By typedef'ing emoji_text_iter_t to your own iterator type, you can implement an adapter class that iterates over an input text buffer in any encoding, and on dereferencing returns the correct Ragel class by implementing something similar to the following Unicode character class to Ragel class mapping, example taken from Chromium:

char EmojiSegmentationCategory(UChar32 codepoint) {
  // Specific ones first.
  if (codepoint == kCombiningEnclosingKeycapCharacter)
    return COMBINING_ENCLOSING_KEYCAP;
  if (codepoint == kCombiningEnclosingCircleBackslashCharacter)
    return COMBINING_ENCLOSING_CIRCLE_BACKSLASH;
  if (codepoint == kZeroWidthJoinerCharacter)
    return ZWJ;
  if (codepoint == kVariationSelector15Character)
    return VS15;
  if (codepoint == kVariationSelector16Character)
    return VS16;
  if (codepoint == 0x1F3F4)
    return TAG_BASE;
  if ((codepoint >= 0xE0030 && codepoint <= 0xE0039) ||
      (codepoint >= 0xE0061 && codepoint <= 0xE007A))
    return TAG_SEQUENCE;
  if (codepoint == 0xE007F)
    return TAG_TERM;
  if (Character::IsEmojiModifierBase(codepoint)) {
    if (Character::IsEmojiEmojiDefault(codepoint))
      return EMOJI_MODIFIER_BASE_EMOJI;
    return EMOJI_MODIFIER_BASE_TEXT;
  }
  if (Character::IsModifier(codepoint))
    return EMOJI_MODIFIER;
  if (Character::IsRegionalIndicator(codepoint))
    return REGIONAL_INDICATOR;
  if (Character::IsEmojiKeycapBase(codepoint))
    return KEYCAP_BASE;

  if (Character::IsEmojiEmojiDefault(codepoint))
    return EMOJI_EMOJI_PRESENTATION;
  if (Character::IsEmojiTextDefault(codepoint))
    return EMOJI_TEXT_PRESENTATION;
  if (Character::IsEmoji(codepoint))
    return EMOJI;

  // Ragel state machine will interpret unknown category as "any".
  return kMaxEmojiScannerCategory;
}

Update/Build requisites

You need to have ragel installed if you want to modify the grammar and generate a new C file as output.

apt-get install ragel

then run

make

to update the emoji_presentation_scanner.c and emoji_presentation_scanner_vs.c output C source file.

Contributing

See the CONTRIBUTING.md file for how to contribute.