Bulk Memory Operations and Conditional Segment Initialization

Motivation for Bulk Memory Operations

Some people have mentioned that memcpy and memmove functions are hot when profiling some WebAssembly benchmarks. Some examples:

I've been looking at perf profiles for wasm unity benchmark a bit recently and see that some of the hottest functions are doing memcpy or memset like things. If this is any indication of normal wasm code patterns, I think we could see significant improvement with an intrinsic so it may be worth prioritizing.

In a number of game engines I've been optimizing and benchmarking, interestingly the performance of memcpy() does show up relatively high in profiles. (~2%-5% of total execution time)

Bulk Memory Operations Prototype

I implemented a prototype implementation of a memory.copy instruction in v8 which just calls out to v8‘s MemMove function. I compared this to an implementation generated by emscripten and currently used in the Unity demo. This implementation aligns then performs copies using i32.load and i32.store. I’ve also included performance achieved by unrolling this loop manually and increasing the size to i64.

Each test copies size bytes from one address to another, non-overlapping. This is repeated N times. Each row copies a total of 1 Gib of data, and only touches 1 Mib of memory in the source and destination ranges.

This is the core loop:

  let mask = Mib - 1;
  let start = performance.now();
  for (let i = 0; i < N; ++i) {
    f(dst_base + dst, src_base + src, size);
    dst = (dst + size) & mask;
    src = (src + size) & mask;
  }
  let end = performance.now();

The code for the benchmark can be found here. Note that this will not run properly without a WebAssembly implementation of memory.copy. For my tests, I hacked a version of v8 to replace any exported function called memcpy or memmove with a new function with the following contents:

(func (param $dst i32) (param $src i32) (param $size i32) (result i32)
  local.get $dst
  local.get $src
  local.get $size
  memory.copy
  local.get $dst)

Here are the results on my machine (x86_64, 2.9GHz, L1 32k, L2 256k, L3 256k):

intrinsici64 load/store x 4i64 load/store x 2i32 load/store x 2i32 load/store
size=32b, N=335544321.382 Gib/s1.565 Gib/s1.493 Gib/s1.275 Gib/s1.166 Gib/s
size=64b, N=167772163.285 Gib/s2.669 Gib/s2.383 Gib/s1.861 Gib/s1.639 Gib/s
size=128b, N=83886086.162 Gib/s3.993 Gib/s3.480 Gib/s2.433 Gib/s2.060 Gib/s
size=256b, N=41943049.939 Gib/s5.323 Gib/s4.462 Gib/s2.724 Gib/s2.213 Gib/s
size=512b, N=209715215.777 Gib/s6.377 Gib/s4.913 Gib/s3.231 Gib/s2.457 Gib/s
size=1.0Kib, N=104857617.902 Gib/s7.010 Gib/s6.112 Gib/s3.568 Gib/s2.614 Gib/s
size=2.0Kib, N=52428819.870 Gib/s8.248 Gib/s6.915 Gib/s3.764 Gib/s2.699 Gib/s
size=4.0Kib, N=26214420.940 Gib/s9.145 Gib/s7.400 Gib/s3.871 Gib/s2.729 Gib/s
size=8.0Kib, N=13107221.162 Gib/s9.258 Gib/s7.672 Gib/s3.925 Gib/s2.763 Gib/s
size=16.0Kib, N=6553620.991 Gib/s9.758 Gib/s7.756 Gib/s3.945 Gib/s2.773 Gib/s
size=32.0Kib, N=3276822.504 Gib/s9.956 Gib/s7.861 Gib/s3.966 Gib/s2.780 Gib/s
size=64.0Kib, N=1638422.534 Gib/s10.088 Gib/s7.931 Gib/s3.974 Gib/s2.782 Gib/s
size=128.0Kib, N=819229.728 Gib/s10.032 Gib/s7.934 Gib/s3.975 Gib/s2.782 Gib/s
size=256.0Kib, N=409629.742 Gib/s10.116 Gib/s7.625 Gib/s3.886 Gib/s2.781 Gib/s
size=512.0Kib, N=204829.994 Gib/s10.090 Gib/s7.627 Gib/s3.985 Gib/s2.785 Gib/s
size=1.0Mib, N=102411.760 Gib/s10.091 Gib/s7.959 Gib/s3.989 Gib/s2.787 Gib/s

Motivation for Conditional Segment Initialization

Under the current threading proposal, to share a module between multiple agents, the module must be instantiated multiple times: once per agent. Instantiation initializes linear memory with the contents in the module's data segments. If the memory is shared between multiple agents, it will be initialized multiple times, potentially overwriting stores that occurred after the previous initializations.

For example:

;; The module.
(module
  (memory (export "memory") 1)

  ;; Some value used as a counter.
  (data (i32.const 0) "\0")

  ;; Add one to the counter.
  (func (export "addOne")
    (i32.store8
      (i32.const 0)
      (i32.add
        (i32.load8_u (i32.const 0))
        (i32.const 1)))
  )
)
// main.js
let moduleBytes = ...;

WebAssembly.instantiate(moduleBytes).then(
  ({module, instance}) => {
    // Increment our counter.
    instance.exports.addOne();

    // Spawn a new Worker.
    let worker = new Worker('worker.js');

    // Send the module to the new Worker.
    worker.postMessage(module);
  });

// worker.js

function onmessage(event) {
  let module = event.data;

  // Use the module to create another instance.
  WebAssembly.instantiate(module).then(
    (instance) => {
      // Oops, our counter has been clobbered.
    });
}

This can be worked around by storing the data segments in a separate module which is only instantiated once, then exporting this memory to be used by another module that contains only code. This works, but it cumbersome since it requires two modules where one should be enough.

Motivation for combining Bulk Memory Operations + Conditional Segment Initialization Proposals

When discussing the design of Conditional Segment Initialization, we found that programmatic memory initialization from a read-only data segment (via the memory.init instruction, described below) has similar behavior to the proposed instruction to copy memory regions from linear memory (memory.copy, also described below.)

Design

Copying between regions in linear memory or a table is accomplished with the new *.copy instructions:

  • memory.copy: copy from one region of linear memory to another
  • table.copy: copy from one region of a table to another

Filling a memory region can be accomplished with memory.fill:

  • memory.fill: fill a region of linear memory with a given byte value

The binary format for the data section currently has a collection of segments, each of which has a memory index, an initializer expression for its offset, and its raw data.

Since WebAssembly currently does not allow for multiple memories, the memory index of each segment must be zero. We can repurpose this 32-bit integer as a flags field where new meaning is attached to nonzero values.

When the low bit of the new flags field is 1, this segment is passive. A passive segment will not be automatically copied into the memory or table on instantiation, and must instead be applied manually using the following new instructions:

  • memory.init: copy a region from a data segment
  • table.init: copy a region from an element segment

A passive segment has no initializer expression, since it will be specified as an operand to memory.init or table.init.

Segments can also be shrunk to size zero by using the following new instructions:

  • data.drop: discard the data in an data segment
  • elem.drop: discard the data in an element segment

An active segment is equivalent to a passive segment, but with an implicit memory.init followed by a data.drop (or table.init followed by a elem.drop) that is prepended to the module's start function.

Additionally, the reference-types proposal introduces the notion of a function reference (a function whose address is a program value). To support this, element segments can have several encodings, and can also be used to forward-declare functions whose address will be taken; see below.

The reference-types proposal also introduces the bulk instructions table.fill and table.grow, both of which take a function reference as an initializer argument.

Data segments

The meaning of the bits of the flag field (a varuint32) for data segments is:

BitMeaning
00=is active, 1=is passive
1if bit 0 clear: 0=memory 0, 1=has memory index

which yields this view, with the fields carried by each flag value:

FlagsMeaningMemory indexOffset in memoryCountPayload
0Activeinit_exprvaruint32u8*
1Passivevaruint32u8*
2Active with memory indexvaruint32init_exprvaruint32u8*

All other flag values are illegal. At present the memory index must be zero, but the upcoming multi-memory proposal changes that.

Element segments

The meaning of the bits of the flag field (a varuint32) for element segments is:

BitMeaning
00=is active, 1=is passive
1if bit 0 clear: 0=table 0, 1=has table index
if bit 0 set: 0=active, 1=declared
20=carries indices; 1=carries elemexprs

which yields this view, with the fields carried by each flag value:

FlagMeaningTable indexOffset in tableEncodingCountPayload
0Legacy active, funcref externvalinit_exprvaruint32idx*
1Passive, externvalextern_kindvaruint32idx*
2Active, externvalvaruint32init_exprextern_kindvaruint32idx*
3Declared, externvalextern_kindvaruint32idx*
4Legacy active, funcref elemexprinit_exprvaruint32elem_expr*
5Passive, elemexprelem_typevaruint32elem_expr*
6Active, elemexprvaruint32init_exprelem_typevaruint32elem_expr*
7Declared, elemexprelem_typevaruint32elem_expr*

All other flag values are illegal. Note that the “declared” attribute is not used by this proposal, but is used by the reference-types proposal.

The extern_kind must be zero, signifying a function definition. An idx is a varuint32 that references an entity in the module, currently only its function table.

At present the table index must be zero, but the reference-types proposal introduces a notion of multiple tables.

An elem_expr is like an init_expr, but can only contain expressions of the following sequences:

BinaryTextDescription
0xd0 0x0bref.null endReturns a null reference
0xd2 varuint32 0x0bref.func $funcidx endReturns a reference to function $funcidx

Segment Initialization

In the MVP, segments are initialized during module instantiation. If any segment would be initialized out-of-bounds, then the memory or table instance is not modified.

This behavior is changed in the bulk memory proposal:

Each active segment is initialized in module-definition order. For each segment, if reading the source or writing the destination would go out of bounds, then instantiation fails at that point. Data that had already been written for previous (in-bounds) segments stays written.

memory.init instruction

The memory.init instruction copies data from a given passive segment into a target memory. The target memory and source segment are given as immediates.

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

  • top-2: destination address
  • top-1: offset into the source segment
  • top-0: size of memory region in bytes

It is a validation error to use memory.init with an out-of-bounds segment index.

A trap occurs if:

  • the source offset plus size is greater than the length of the source data segment; this includes the case that the segment has been dropped via data.drop
  • the destination offset plus size is greater than the length of the target memory

The order of writing is unspecified, though this is currently unobservable.

Note that it is allowed to use memory.init on the same data segment more than once.

data.drop instruction

The data.drop instruction shrinks the size of the segment to zero. After a data segment has been dropped, it can still be used in a memory.init instruction, but only a zero-length access at offset zero will not trap. This instruction is intended to be used as an optimization hint to the WebAssembly implementation. After a memory segment is dropped its data can no longer be retrieved, so the memory used by this segment may be freed.

It is a validation error to use data.drop with an out-of-bounds segment index.

memory.copy instruction

Copy data from a source memory region to destination region. The regions are said to overlap if they are in the same memory and the start address of one region is one of the addresses that's read or written (by the copy operation) in the other region.

This instruction has two immediate arguments: the source and destination memory indices. They currently both must be zero.

Copying takes place as if an intermediate buffer were used, allowing the destination and source to overlap.

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

  • top-2: destination address
  • top-1: source address
  • top-0: size of memory region in bytes

A trap occurs if:

  • the source offset plus size is greater than the length of the source memory
  • the destination offset plus size is greater than the length of the target memory

The bounds check is performed before any data are written.

memory.fill instruction

Set all bytes in a memory region to a given byte. This instruction has an immediate argument of which memory to operate on, and it must be zero for now.

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

  • top-2: destination address
  • top-1: byte value to set
  • top-0: size of memory region in bytes

A trap occurs if:

  • the destination offset plus size is greater than the length of the target memory

The bounds check is performed before any data are written.

table.init, elem.drop, and table.copy instructions

The table.* instructions behave similarly to the memory.* instructions, with the difference that they operate on element segments and tables, instead of data segments and memories. The offset and length operands of table.init and table.copy have element units instead of bytes as well.

Passive Segment Initialization Example

Consider if there are two data sections, the first is always active and the second is conditionally active if global 0 has a non-zero value. This could be implemented as follows:

(import "a" "global" (global i32))  ;; global 0
(memory 1)
(data (i32.const 0) "hello")   ;; data segment 0, is active so always copied
(data "goodbye")               ;; data segment 1, is passive

(func $start
  (if (global.get 0)

    ;; copy data segment 1 into memory 0 (the 0 is implicit)
    (memory.init 1
      (i32.const 16)    ;; target offset
      (i32.const 0)     ;; source offset
      (i32.const 7))    ;; length

    ;; The memory used by this segment is no longer needed, so this segment can
    ;; be dropped.
    (data.drop 1))
)
(start $start)

Instruction encoding

All bulk memory instructions are encoded as a 0xfc prefix byte, followed by another opcode, optionally followed by more immediates:

instr ::= ...
        | 0xfc operation:uint8 ...
NameOpcodeImmediateDescription
memory.init0xfc 0x08segment:varuint32, memory:0x00copy from a passive data segment to linear memory
data.drop0xfc 0x09segment:varuint32prevent further use of passive data segment
memory.copy0xfc 0x0amemory_dst:0x00 memory_src:0x00copy from one region of linear memory to another region
memory.fill0xfc 0x0bmemory:0x00fill a region of linear memory with a given byte value
table.init0xfc 0x0csegment:varuint32, table:0x00copy from a passive element segment to a table
elem.drop0xfc 0x0dsegment:varuint32prevent further use of a passive element segment
table.copy0xfc 0x0etable_dst:0x00 table_src:0x00copy from one region of a table to another region

DataCount section

The WebAssembly binary format is designed to be validated in a single pass. If a section requires information to validate, it is guaranteed that this information will be present in a previous section.

The memory.{init,drop} instructions break this guarantee. Both of these instructions are used in the Code section. They each have a data segment index immediate, but the vector of data segments is not available until the Data section is parsed, which occurs after the Code section.

To keep single-pass validation, the number of data segments defined in the Data section must be available before the Code section. This information is provided in a new DataCount section with the code 12.

Like all sections, the DataCount section is optional. If present, it must appear in the following order:

Section NameCodeDescription
Type1Function signature declarations
Import2Import declarations
Function3Function declarations
Table4Indirect function table and other tables
Memory5Memory attributes
Global6Global declarations
Export7Exports
Start8Start function declaration
Element9Elements section
DataCount12Data segment count
Code10Function bodies (code)
Data11Data segments

The DataCount section has just one field that specifies the number of data segments in the Data section:

FieldTypeDescription
countvaruint32count of data segments in Data section

The binary is malformed if count is not equal to the number of data segments in the Data section. The binary is also malformed if the DataCount section is omitted and a memory.init or data.drop instruction is used.