Bulk Memory Operations and Conditional Segment Initialization

Motivation for Bulk Memory Operations

Some people have mentioned that memcpy and memmove functions are hot when profiling some WebAssembly benchmarks. Some examples:

https://github.com/WebAssembly/design/issues/236#issuecomment-283279499

I've been looking at perf profiles for wasm unity benchmark a bit recently and see that some of the hottest functions are doing memcpy or memset like things. If this is any indication of normal wasm code patterns, I think we could see significant improvement with an intrinsic so it may be worth prioritizing.

https://github.com/WebAssembly/design/issues/977#issue-204960079

In a number of game engines I've been optimizing and benchmarking, interestingly the performance of memcpy() does show up relatively high in profiles. (~2%-5% of total execution time)

Bulk Memory Operations Prototype

I implemented a prototype implementation of a memory.copy instruction in v8 which just calls out to v8‘s MemMove function. I compared this to an implementation generated by emscripten and currently used in the Unity demo. This implementation aligns then performs copies using i32.load and i32.store. I’ve also included performance achieved by unrolling this loop manually and increasing the size to i64.

Each test copies size bytes from one address to another, non-overlapping. This is repeated N times. Each row copies a total of 1 Gib of data, and only touches 1 Mib of memory in the source and destination ranges.

This is the core loop:

  let mask = Mib - 1;
  let start = performance.now();
  for (let i = 0; i < N; ++i) {
    f(dst_base + dst, src_base + src, size);
    dst = (dst + size) & mask;
    src = (src + size) & mask;
  }
  let end = performance.now();

The code for the benchmark can be found here. Note that this will not run properly without a WebAssembly implementation of memory.copy. For my tests, I hacked a version of v8 to replace any exported function called memcpy or memmove with a new function with the following contents:

(func (param $dst i32) (param $src i32) (param $size i32) (result i32)
  local.get $dst
  local.get $src
  local.get $size
  memory.copy
  local.get $dst)

Here are the results on my machine (x86_64, 2.9GHz, L1 32k, L2 256k, L3 256k):

	intrinsic	i64 load/store x 4	i64 load/store x 2	i32 load/store x 2	i32 load/store
size=32b, N=33554432	1.382 Gib/s	1.565 Gib/s	1.493 Gib/s	1.275 Gib/s	1.166 Gib/s
size=64b, N=16777216	3.285 Gib/s	2.669 Gib/s	2.383 Gib/s	1.861 Gib/s	1.639 Gib/s
size=128b, N=8388608	6.162 Gib/s	3.993 Gib/s	3.480 Gib/s	2.433 Gib/s	2.060 Gib/s
size=256b, N=4194304	9.939 Gib/s	5.323 Gib/s	4.462 Gib/s	2.724 Gib/s	2.213 Gib/s
size=512b, N=2097152	15.777 Gib/s	6.377 Gib/s	4.913 Gib/s	3.231 Gib/s	2.457 Gib/s
size=1.0Kib, N=1048576	17.902 Gib/s	7.010 Gib/s	6.112 Gib/s	3.568 Gib/s	2.614 Gib/s
size=2.0Kib, N=524288	19.870 Gib/s	8.248 Gib/s	6.915 Gib/s	3.764 Gib/s	2.699 Gib/s
size=4.0Kib, N=262144	20.940 Gib/s	9.145 Gib/s	7.400 Gib/s	3.871 Gib/s	2.729 Gib/s
size=8.0Kib, N=131072	21.162 Gib/s	9.258 Gib/s	7.672 Gib/s	3.925 Gib/s	2.763 Gib/s
size=16.0Kib, N=65536	20.991 Gib/s	9.758 Gib/s	7.756 Gib/s	3.945 Gib/s	2.773 Gib/s
size=32.0Kib, N=32768	22.504 Gib/s	9.956 Gib/s	7.861 Gib/s	3.966 Gib/s	2.780 Gib/s
size=64.0Kib, N=16384	22.534 Gib/s	10.088 Gib/s	7.931 Gib/s	3.974 Gib/s	2.782 Gib/s
size=128.0Kib, N=8192	29.728 Gib/s	10.032 Gib/s	7.934 Gib/s	3.975 Gib/s	2.782 Gib/s
size=256.0Kib, N=4096	29.742 Gib/s	10.116 Gib/s	7.625 Gib/s	3.886 Gib/s	2.781 Gib/s
size=512.0Kib, N=2048	29.994 Gib/s	10.090 Gib/s	7.627 Gib/s	3.985 Gib/s	2.785 Gib/s
size=1.0Mib, N=1024	11.760 Gib/s	10.091 Gib/s	7.959 Gib/s	3.989 Gib/s	2.787 Gib/s

Motivation for Conditional Segment Initialization

Under the current threading proposal, to share a module between multiple agents, the module must be instantiated multiple times: once per agent. Instantiation initializes linear memory with the contents in the module's data segments. If the memory is shared between multiple agents, it will be initialized multiple times, potentially overwriting stores that occurred after the previous initializations.

For example:

;; The module.
(module
  (memory (export "memory") 1)

  ;; Some value used as a counter.
  (data (i32.const 0) "\0")

  ;; Add one to the counter.
  (func (export "addOne")
    (i32.store8
      (i32.const 0)
      (i32.add
        (i32.load8_u (i32.const 0))
        (i32.const 1)))
  )
)

// main.js
let moduleBytes = ...;

WebAssembly.instantiate(moduleBytes).then(
  ({module, instance}) => {
    // Increment our counter.
    instance.exports.addOne();

    // Spawn a new Worker.
    let worker = new Worker('worker.js');

    // Send the module to the new Worker.
    worker.postMessage(module);
  });

// worker.js

function onmessage(event) {
  let module = event.data;

  // Use the module to create another instance.
  WebAssembly.instantiate(module).then(
    (instance) => {
      // Oops, our counter has been clobbered.
    });
}

This can be worked around by storing the data segments in a separate module which is only instantiated once, then exporting this memory to be used by another module that contains only code. This works, but it cumbersome since it requires two modules where one should be enough.

Motivation for combining Bulk Memory Operations + Conditional Segment Initialization Proposals

When discussing the design of Conditional Segment Initialization, we found that programmatic memory initialization from a read-only data segment (via the memory.init instruction, described below) has similar behavior to the proposed instruction to copy memory regions from linear memory (memory.copy, also described below.)

Design

Copying between regions in linear memory or a table is accomplished with the new *.copy instructions:

memory.copy: copy from one region of linear memory to another
table.copy: copy from one region of a table to another

Filling a memory region can be accomplished with memory.fill:

memory.fill: fill a region of linear memory with a given byte value

The binary format for the data section currently has a collection of segments, each of which has a memory index, an initializer expression for its offset, and its raw data.

Since WebAssembly currently does not allow for multiple memories, the memory index of each segment must be zero. We can repurpose this 32-bit integer as a flags field where new meaning is attached to nonzero values.

When the low bit of the new flags field is 1, this segment is passive. A passive segment will not be automatically copied into the memory or table on instantiation, and must instead be applied manually using the following new instructions:

memory.init: copy a region from a data segment
table.init: copy a region from an element segment

A passive segment has no initializer expression, since it will be specified as an operand to memory.init or table.init.

Segments can also be shrunk to size zero by using the following new instructions:

data.drop: discard the data in an data segment
elem.drop: discard the data in an element segment

An active segment is equivalent to a passive segment, but with an implicit memory.init followed by a data.drop (or table.init followed by a elem.drop) that is prepended to the module's start function.

Additionally, the reference-types proposal introduces the notion of a function reference (a function whose address is a program value). To support this, element segments can have several encodings, and can also be used to forward-declare functions whose address will be taken; see below.

The reference-types proposal also introduces the bulk instructions table.fill and table.grow, both of which take a function reference as an initializer argument.

Data segments

The meaning of the bits of the flag field (a varuint32) for data segments is:

Bit	Meaning
0	0=is active, 1=is passive
1	if bit 0 clear: 0=memory 0, 1=has memory index

which yields this view, with the fields carried by each flag value:

Flags	Meaning	Memory index	Offset in memory	Count	Payload
0	Active		`init_expr`	`varuint32`	`u8`*
1	Passive			`varuint32`	`u8`*
2	Active with memory index	`varuint32`	`init_expr`	`varuint32`	`u8`*

All other flag values are illegal. At present the memory index must be zero, but the upcoming multi-memory proposal changes that.

Element segments

The meaning of the bits of the flag field (a varuint32) for element segments is:

Bit	Meaning
0	0=is active, 1=is passive
1	if bit 0 clear: 0=table 0, 1=has table index
	if bit 0 set: 0=active, 1=declared
2	0=carries indices; 1=carries elemexprs

which yields this view, with the fields carried by each flag value:

Flag	Meaning	Table index	Offset in table	Encoding	Count	Payload
0	Legacy active, funcref externval		`init_expr`		`varuint32`	`idx`*
1	Passive, externval			`extern_kind`	`varuint32`	`idx`*
2	Active, externval	`varuint32`	`init_expr`	`extern_kind`	`varuint32`	`idx`*
3	Declared, externval			`extern_kind`	`varuint32`	`idx`*
4	Legacy active, funcref elemexpr		`init_expr`		`varuint32`	`elem_expr`*
5	Passive, elemexpr			`elem_type`	`varuint32`	`elem_expr`*
6	Active, elemexpr	`varuint32`	`init_expr`	`elem_type`	`varuint32`	`elem_expr`*
7	Declared, elemexpr			`elem_type`	`varuint32`	`elem_expr`*

All other flag values are illegal. Note that the “declared” attribute is not used by this proposal, but is used by the reference-types proposal.

The extern_kind must be zero, signifying a function definition. An idx is a varuint32 that references an entity in the module, currently only its function table.

At present the table index must be zero, but the reference-types proposal introduces a notion of multiple tables.

An elem_expr is like an init_expr, but can only contain expressions of the following sequences:

Binary	Text	Description
`0xd0 0x0b`	`ref.null end`	Returns a null reference
`0xd2 varuint32 0x0b`	`ref.func $funcidx end`	Returns a reference to function `$funcidx`

Segment Initialization

In the MVP, segments are initialized during module instantiation. If any segment would be initialized out-of-bounds, then the memory or table instance is not modified.

This behavior is changed in the bulk memory proposal:

Each active segment is initialized in module-definition order. For each segment, if reading the source or writing the destination would go out of bounds, then instantiation fails at that point. Data that had already been written for previous (in-bounds) segments stays written.

`memory.init` instruction

The memory.init instruction copies data from a given passive segment into a target memory. The target memory and source segment are given as immediates.

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

top-2: destination address
top-1: offset into the source segment
top-0: size of memory region in bytes

It is a validation error to use memory.init with an out-of-bounds segment index.

A trap occurs if:

the source offset plus size is greater than the length of the source data segment; this includes the case that the segment has been dropped via data.drop
the destination offset plus size is greater than the length of the target memory

The order of writing is unspecified, though this is currently unobservable.

Note that it is allowed to use memory.init on the same data segment more than once.

`data.drop` instruction

The data.drop instruction shrinks the size of the segment to zero. After a data segment has been dropped, it can still be used in a memory.init instruction, but only a zero-length access at offset zero will not trap. This instruction is intended to be used as an optimization hint to the WebAssembly implementation. After a memory segment is dropped its data can no longer be retrieved, so the memory used by this segment may be freed.

It is a validation error to use data.drop with an out-of-bounds segment index.

`memory.copy` instruction

Copy data from a source memory region to destination region. The regions are said to overlap if they are in the same memory and the start address of one region is one of the addresses that's read or written (by the copy operation) in the other region.

This instruction has two immediate arguments: the source and destination memory indices. They currently both must be zero.

Copying takes place as if an intermediate buffer were used, allowing the destination and source to overlap.

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

top-2: destination address
top-1: source address
top-0: size of memory region in bytes

A trap occurs if:

the source offset plus size is greater than the length of the source memory
the destination offset plus size is greater than the length of the target memory

The bounds check is performed before any data are written.

`memory.fill` instruction

Set all bytes in a memory region to a given byte. This instruction has an immediate argument of which memory to operate on, and it must be zero for now.

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

top-2: destination address
top-1: byte value to set
top-0: size of memory region in bytes

A trap occurs if:

the destination offset plus size is greater than the length of the target memory

The bounds check is performed before any data are written.

`table.init`, `elem.drop`, and `table.copy` instructions

The table.* instructions behave similarly to the memory.* instructions, with the difference that they operate on element segments and tables, instead of data segments and memories. The offset and length operands of table.init and table.copy have element units instead of bytes as well.

Passive Segment Initialization Example

Consider if there are two data sections, the first is always active and the second is conditionally active if global 0 has a non-zero value. This could be implemented as follows:

(import "a" "global" (global i32))  ;; global 0
(memory 1)
(data (i32.const 0) "hello")   ;; data segment 0, is active so always copied
(data "goodbye")               ;; data segment 1, is passive

(func $start
  (if (global.get 0)

    ;; copy data segment 1 into memory 0 (the 0 is implicit)
    (memory.init 1
      (i32.const 16)    ;; target offset
      (i32.const 0)     ;; source offset
      (i32.const 7))    ;; length

    ;; The memory used by this segment is no longer needed, so this segment can
    ;; be dropped.
    (data.drop 1))
)
(start $start)

Instruction encoding

All bulk memory instructions are encoded as a 0xfc prefix byte, followed by another opcode, optionally followed by more immediates:

instr ::= ...
        | 0xfc operation:uint8 ...

Name	Opcode	Immediate	Description
`memory.init`	`0xfc 0x08`	`segment:varuint32`, `memory:0x00`	copy from a passive data segment to linear memory
`data.drop`	`0xfc 0x09`	`segment:varuint32`	prevent further use of passive data segment
`memory.copy`	`0xfc 0x0a`	`memory_dst:0x00` `memory_src:0x00`	copy from one region of linear memory to another region
`memory.fill`	`0xfc 0x0b`	`memory:0x00`	fill a region of linear memory with a given byte value
`table.init`	`0xfc 0x0c`	`segment:varuint32`, `table:0x00`	copy from a passive element segment to a table
`elem.drop`	`0xfc 0x0d`	`segment:varuint32`	prevent further use of a passive element segment
`table.copy`	`0xfc 0x0e`	`table_dst:0x00` `table_src:0x00`	copy from one region of a table to another region

`DataCount` section

The WebAssembly binary format is designed to be validated in a single pass. If a section requires information to validate, it is guaranteed that this information will be present in a previous section.

The memory.{init,drop} instructions break this guarantee. Both of these instructions are used in the Code section. They each have a data segment index immediate, but the vector of data segments is not available until the Data section is parsed, which occurs after the Code section.

To keep single-pass validation, the number of data segments defined in the Data section must be available before the Code section. This information is provided in a new DataCount section with the code 12.

Like all sections, the DataCount section is optional. If present, it must appear in the following order:

Section Name	Code	Description
Type	`1`	Function signature declarations
Import	`2`	Import declarations
Function	`3`	Function declarations
Table	`4`	Indirect function table and other tables
Memory	`5`	Memory attributes
Global	`6`	Global declarations
Export	`7`	Exports
Start	`8`	Start function declaration
Element	`9`	Elements section
DataCount	`12`	Data segment count
Code	`10`	Function bodies (code)
Data	`11`	Data segments

The DataCount section has just one field that specifies the number of data segments in the Data section:

Field	Type	Description
count	`varuint32`	count of data segments in `Data` section

The binary is malformed if count is not equal to the number of data segments in the Data section. The binary is also malformed if the DataCount section is omitted and a memory.init or data.drop instruction is used.

Bulk Memory Operations and Conditional Segment Initialization

Motivation for Bulk Memory Operations

Bulk Memory Operations Prototype

Motivation for Conditional Segment Initialization

Motivation for combining Bulk Memory Operations + Conditional Segment Initialization Proposals

Design

Data segments

Element segments

Segment Initialization

memory.init instruction

data.drop instruction

memory.copy instruction

memory.fill instruction

table.init, elem.drop, and table.copy instructions

Passive Segment Initialization Example

Instruction encoding

DataCount section

`memory.init` instruction

`data.drop` instruction

`memory.copy` instruction

`memory.fill` instruction

`table.init`, `elem.drop`, and `table.copy` instructions

`DataCount` section