Blink-V8 bindings generator (bind_gen package)

What's bind_gen?

Python package bind_gen is the core part of Blink-V8 bindings code generator. generate_bindings.py is the driver script, which takes a Web IDL database (web_idl_database.pickle generated by web_idl_database GN target) as an input and produces a set of C++ source files of Blink-V8 bindings (v8_*.h, v8_*.cc).

Design and code structure

The bindings code generator is implemented as a tree builder of CodeNode which is a fundamental building block. The following sub sections describe what CodeNode is and how the code generator builds a tree of CodeNode.

CodeNode

The code generator produces C++ source files (text files) but the content of each file is not represented as a single giant string nor a list of strings. The content of each file is represented as a CodeNode tree.

CodeNode is a fundamental building block that represents a text fragment in the tree structure. A text file is represented as a tree of CodeNodes, each of which represents a corresponding text fragment. The code generator is the CodeNode tree builder.

Here is a simple example to build a CodeNode tree.

# SequenceNode and TextNode are subclasses of CodeNode.

def make_prologue():
  return SequenceNode([
    TextNode("// Prologue"),
    TextNode("SetUp();"),
  ])

def make_epilogue():
  return SequenceNode([
    TextNode("// Epilogue"),
    TextNode("CleanUp();"),
  ])

def main():
  root_node = SequenceNode([
    make_prologue(),
    TextNode("LOG(INFO) << \"hello, world\";"),
    make_epilogue(),
  ])

The root_node above represents the following text.

// Prologue
SetUp();
LOG(INFO) << "hello, world";
// Epilogue
CleanUp();

The basic features of CodeNode are implemented in code_node.py. Just for convenience, CodeNode subclasses corresponding to C++ constructs are provided in code_node_cxx.py.

CodeNode has an object-oriented design and has internal states (not only the parent / child nodes but also more states to support advanced features).

CodeNode tree builders

The bindings code generator consists of multiple sub code generators. For example, interface.py is a sub code generator of Web IDL interface and enumeration.py is a sub code generator of Web IDL enumeration. Each Web IDL definition has its own sub code generator.

This sub section describes how a sub code generator builds a CodeNode tree and produces C++ source files by looking at enumeration.py as an example. The example code snippet below is simplified for explanation.

def generate_enumerations(task_queue):
    for enumeration in web_idl_database.enumerations:
        task_queue.post_task(generate_enumeration, enumeration.identifier)

generate_enumerations is the entry point to this sub code generator. In favor of parallel processing, task_queue is used. generate_enumeration (singular form) actually produces a pair of C++ source files (*.h and *.cc).

def generate_enumeration(enumeration_identifier):
    # Filepaths
    header_path = path_manager.api_path(ext="h")
    source_path = path_manager.api_path(ext="cc")

    # Root nodes
    header_node = ListNode(tail="\n")
    source_node = ListNode(tail="\n")

    # ... fill the contents of `header_node` and `source_node` ...

    # Write down to the files.
    write_code_node_to_file(header_node, path_manager.gen_path_to(header_path))
    write_code_node_to_file(source_node, path_manager.gen_path_to(source_path))

The main task of generate_enumeration is to build CodeNode trees and write them down to files. A key point here is to build two trees in parallel; one for *.h and the other for *.cc. We can add a function declaration to the header file while adding the corresponding function definition to the source file. The following code snippet is an example to add constructors into the header file and the source file.

    # Namespaces
    header_blink_ns = CxxNamespaceNode(name_style.namespace("blink"))
    source_blink_ns = CxxNamespaceNode(name_style.namespace("blink"))
    # {header,source}_blink_ns are added to {header,source}_node (the root
    # nodes) respectively.

    # Class definition
    class_def = CxxClassDefNode(cg_context.class_name,
                                base_class_names=["bindings::EnumerationBase"],
                                final=True,
                                export=component_export(
                                    api_component, for_testing))

    ctor_decls, ctor_defs = make_constructors(cg_context)

    # Define the class in 'blink' namespace.
    header_blink_ns.body.append(class_def)

    # Add constructors to public: section of the class.
    class_def.public_section.append(ctor_decls)
    # Add constructors (function definitions) into 'blink' namespace in the
    # source file.
    source_blink_ns.body.append(ctor_defs)

In the above code snippet, make_constructors creates and returns a CodeNode tree for the header file and another CodeNode tree for the source file. For most cases, functions named make_xxx creates and returns a pair of the CodeNode trees. These functions are subtree builders of the CodeNode trees.

These subtree builders are implemented in a way of functional programming (unlike CodeNodes themselves are implemented in a way of object-oriented programming). These subtree builders create a pair of new CodeNode trees at every function call (returned code node instances are different per call, so their internal states are separate), but the contents are 100% determined solely by the input arguments. This property is very important when we use closures in advanced use cases.

So far, the typical code structure of the sub code generators is covered. enumeration.py consists of several make_xxx functions (subtree builders) + generate_enumeration (the top-level tree builder + file writer).

Advanced: Two-step code generation and declarative style

Typical problems of (simple) code generation

Bindings code generation has the following typical problems. Suppose we have the following simple code generator.

# Example of simple code generation

def make_foo():
  return SequenceNode([
    TextNode("HeavyResource* res = HeavyFunc();"),
    TextNode("Foo(res);"),
  ])

def make_bar():
  return SequenceNode([
    TextNode("HeavyResource* res = HeavyFunc();"),
    TextNode("Bar(res);"),
  ])

def main():
  root_node = SequenceNode([
    make_foo(),
    make_bar(),
  ])

This produces the following C++ code, where we have two major problems. The first problem is a symbol conflict: res is defined twice. Even if we gave different names like res1 and res2, we have the second problem: the produced code calls HeavyFunc twice, which is not efficient.

// Output of simple code generation example
HeavyResource* res = HeavyFunc();
Foo(res);
HeavyResource* res = HeavyFunc();
Bar(res);

Ideally we'd like to have the following code, without introducing tight coupling between make_foo and make_bar.

// Ideal generated code
HeavyResource* res = HeavyFunc();
Foo(res);
Bar(res);

Two-step code generation as a solution

In order to resolve the above problems, the bindings code generator supports two-step code generation. This way may look like declarative programming.

# Example of two-step code generation

def bind_vars(code_node):
  local_vars = [
    SymbolNode("heavy_resource",
               "HeavyResource* ${heavy_resource} = HeavyFunc(${address}, ${phone_number});"),
    SymbolNode("address",
               "String ${address} = GetAddress();"),
    SymbolNode("phone",
               "String ${phone_number} = GetPhoneNumber();"),
  ]
  for symbol_node in local_vars:
    code_node.register_code_symbol(symbol_node)

def make_foo():
  return SequenceNode([
    TextNode("Foo(${heavy_resource});"),
  ])

def make_bar():
  return SequenceNode([
    TextNode("Bar(${heavy_resource});"),
  ])

def main():
  root_node = SymbolScopeNode()
  bind_vars(root_node)
  root_node.extend([
    make_foo(),
    make_bar(),
  ])

The above code generator has two kinds of code generation. One kind is make_foo and make_bar, which are almost the same as before except for use of a template variable (${heavy_resource}). The other kind is bind_vars, which provides a catalogue of symbol definitions. We can make the definitions of make_foo and make_bar simple with using the catalogue of symbol definitions. This code generator produces the following C++ code without producing duplicated function calls.

// Output of two-step code generation example
String address = GetAddress();
String phone_number = GetPhoneNumber();
HeavyResource* heavy_resource = HeavyFunc(address, phone_number);
Foo(heavy_resource);
Bar(heavy_resource);

The mechanism of two-step code generation is simple. SymbolNode(name, definition) consists of a symbol name and code fragment that defines the symbol. When a symbol name is referenced as ${symbol_name}, it's simply replaced with symbol_name, plus it triggers insertion of the symbol definition into a surrounding SequenceNode. This step happens recursively. So not only heavy_resource's definition but also address and phone_number's definitions are inserted, too.

With the two-step code generation, it's possible (and expected) to write code generators in the declarative programming style, which works better in general than the imperative programming style.

Important subclasses of CodeNode for two-step code generation

SymbolNode consists of a symbol name and its definition. You can reference a symbol as ${symbol_name} in TextNode and FormatNode. It's okay that you never reference a symbol. The symbol definition will be automatically inserted only when you reference the symbol.

For simple use cases, a SymbolNode can be constructed from a pair of a symbol name and a plain text (which can contain references in the form of ${...}) as the definition.

# Example of simple use cases
addr_symbol = SymbolNode("address",
                         "void* ${address} = ${base} + ${offset};")

For more complicated use cases, SymbolNode's definition can be a callable that returns a SymbolDefinitionNode instead. This is useful when the definition has a complex structure of code node tree, since a plain text definition cannot represent a code node tree structure.

# Example of complicated use cases
def create_address(symbol_node):
  node = SymbolDefinitionNode(symbol_node)
  node.extend([
    TextNode("void* ${address} = ${base} + ${offset};"),
    CxxUnlikelyIfNode(
      cond="!${address}",
      attribute=None,
      body=[
        TextNode("${exception_state}.ThrowRangeError(\"...\");"),
        TextNode("return;"),
      ]),
  ])
  return node

addr_symbol = SymbolNode("address",
                         definition_constructor=create_address)

where CxxUnlikelyIfNode represents a C++ if statement with an unlikely condition (defined in code_node_cxx.py). This definition is better than a plain text definition because it inserts the definition of ${exception_state} at the best position depending on how much likely ${exception_state} is actually used.

// Output of the example of complicated use cases
void* base = ...;  // ${base}'s definition is automatically inserted.
void* offset = ...;  // ${offset}'s definition is automatically inserted.
// ${exception_state}'s definition may be inserted here if it's used often or
// outside of the following if statement.
// ExceptionState exception_state(...);
void* address = base + offset;
if (!address) {
  // ${exception_state}'s definition may be inserted here if it's not used often
  // or outside of this if statement.
  ExceptionState exception_state(...);
  exception_state.ThrowRangeError("...");
  return;
}

SymbolDefinitionNode represents the code fragment that defines a symbol. The code generator automatically inserts symbol definitions at the best positions heuristically. However it's hard to determine the best position in one path calculation, so the code generator iterates symbol definition insertions/relocations until it finds the heuristically best positions. SymbolDefinitionNode is used to identify a subtree of code nodes that defines its symbol (i.e. used to distinguish automatically inserted code nodes from the original code node tree).

SequenceNode represents not only a list of CodeNodes but also insertion points of SymbolDefinitionNode. SymbolDefinitionNodes will be inserted between elements within a SequenceNode.

Compared to SequenceNode, ListNode represents just a list of CodeNodes that does not support automatic insertion of symbol definitions, i.e. ListNode is indivisible. SequenceNode should be used when your code nodes represent a series of C++ statements, otherwise ListNode is preferred over SequenceNode so that nothing will be inserted in between. See the following example.

# Example of SequenceNode vs ListNode
int_array = ListNode([
  TextNode("int int_array[] = {"),
  ListNode([
    TextNode("${foo}"),
    TextNode("${bar}"),
  ], separator=","),
  TextNode("};"),
])

node = SequenceNode([
  int_array,
  TextNode("PrintIntArray(int_array);"),
])

This example produces the following C++ code. Since symbol definitions are inserted only between elements of SequenceNode, ${foo} and ${bar}‘s definitions won’t be inserted within int_array's definition.

// Output of SequenceNode vs ListNode example
int foo = ...;  // ${foo}'s definition is automatically inserted here.
int bar = ...;  // ${bar}'s definition is automatically inserted here.
int array[] = {
  // ${foo}'s definition is _not_ inserted here.
  foo,
  // ${bar}'s definition is _not_ inserted here.
  bar
};
PrintIntArray(int_array);

You can register SymbolNodes only into a SymbolScopeNode. Registered symbols are effective only inside the SymbolScopeNode. This behavior reflects that C++ variables are effective only inside the closest containing C++ block ({...}).

Tips for debugging and code reading

The driver script generate_bindings.py supports two useful command line flags: --format_generated_files and --enable_code_generation_tracing.

--format_generated_files runs clang-format for the generated files so that they are easy for developers to read.

--enable_code_generation_tracing outputs code comments (e.g. /* make_wrapper_type_info:6304 */ in addition to the regular output in order to clarify which line of the code generator code generated which line of generated code. This is useful to understand the correspondence between the code generator and generated code.

When the tracing comments show functions which are too common and uninteresting to you (e.g. make_blink_to_v8_value), you can exclude such functions module-by-module basis by using CodeGenTracing.add_modules_to_be_ignored.

Here is an example command line to run the script with the options (working fine as of 2024 May).

# Run generate_bindings.py with --format_generated_files and
# --enable_code_generation_tracing.
#
# web_idl_database.pickle must have already been generated and updated.
# Or, run 'autoninja -C out/Default web_idl_database' in advance.

$ cd out/Default
$ python3 ../../third_party/blink/renderer/bindings/scripts/generate_bindings.py \
async_iterator callback_function callback_interface dictionary enumeration interface namespace observable_array sync_iterator typedef union \
--web_idl_database gen/third_party/blink/renderer/bindings/web_idl_database.pickle \
--root_src_dir=../.. \
--root_gen_dir=gen \
--output_reldir=core=third_party/blink/renderer/bindings/core/v8/ \
--output_reldir=modules=third_party/blink/renderer/bindings/modules/v8/ \
--output_reldir=extensions_chromeos=third_party/blink/renderer/bindings/extensions_chromeos/v8/ \
--format_generated_files \
--enable_code_generation_tracing