Some people have mentioned that memcpy and memmove functions are hot when profiling some WebAssembly benchmarks. Some examples:
I've been looking at perf profiles for wasm unity benchmark a bit recently and see that some of the hottest functions are doing memcpy or memset like things. If this is any indication of normal wasm code patterns, I think we could see significant improvement with an intrinsic so it may be worth prioritizing.
In a number of game engines I've been optimizing and benchmarking, interestingly the performance of memcpy() does show up relatively high in profiles. (~2%-5% of total execution time)
I implemented a prototype implementation of a memory.copy instruction in v8 which just calls out to v8‘s MemMove function. I compared this to an implementation generated by emscripten and currently used in the Unity demo. This implementation aligns then performs copies using i32.load and i32.store. I’ve also included performance achieved by unrolling this loop manually and increasing the size to i64.
Each test copies size bytes from one address to another, non-overlapping. This is repeated N times. Each row copies a total of 1 Gib of data, and only touches 1 Mib of memory in the source and destination ranges.
This is the core loop:
let mask = Mib - 1; let start = performance.now(); for (let i = 0; i < N; ++i) { f(dst_base + dst, src_base + src, size); dst = (dst + size) & mask; src = (src + size) & mask; } let end = performance.now();
The code for the benchmark can be found here. Note that this will not run properly without a WebAssembly implementation of memory.copy. For my tests, I hacked a version of v8 to replace any exported function called memcpy or memmove with a new function with the following contents:
(func (param $dst i32) (param $src i32) (param $size i32) (result i32) local.get $dst local.get $src local.get $size memory.copy local.get $dst)
Here are the results on my machine (x86_64, 2.9GHz, L1 32k, L2 256k, L3 256k):
| intrinsic | i64 load/store x 4 | i64 load/store x 2 | i32 load/store x 2 | i32 load/store | |
|---|---|---|---|---|---|
| size=32b, N=33554432 | 1.382 Gib/s | 1.565 Gib/s | 1.493 Gib/s | 1.275 Gib/s | 1.166 Gib/s |
| size=64b, N=16777216 | 3.285 Gib/s | 2.669 Gib/s | 2.383 Gib/s | 1.861 Gib/s | 1.639 Gib/s |
| size=128b, N=8388608 | 6.162 Gib/s | 3.993 Gib/s | 3.480 Gib/s | 2.433 Gib/s | 2.060 Gib/s |
| size=256b, N=4194304 | 9.939 Gib/s | 5.323 Gib/s | 4.462 Gib/s | 2.724 Gib/s | 2.213 Gib/s |
| size=512b, N=2097152 | 15.777 Gib/s | 6.377 Gib/s | 4.913 Gib/s | 3.231 Gib/s | 2.457 Gib/s |
| size=1.0Kib, N=1048576 | 17.902 Gib/s | 7.010 Gib/s | 6.112 Gib/s | 3.568 Gib/s | 2.614 Gib/s |
| size=2.0Kib, N=524288 | 19.870 Gib/s | 8.248 Gib/s | 6.915 Gib/s | 3.764 Gib/s | 2.699 Gib/s |
| size=4.0Kib, N=262144 | 20.940 Gib/s | 9.145 Gib/s | 7.400 Gib/s | 3.871 Gib/s | 2.729 Gib/s |
| size=8.0Kib, N=131072 | 21.162 Gib/s | 9.258 Gib/s | 7.672 Gib/s | 3.925 Gib/s | 2.763 Gib/s |
| size=16.0Kib, N=65536 | 20.991 Gib/s | 9.758 Gib/s | 7.756 Gib/s | 3.945 Gib/s | 2.773 Gib/s |
| size=32.0Kib, N=32768 | 22.504 Gib/s | 9.956 Gib/s | 7.861 Gib/s | 3.966 Gib/s | 2.780 Gib/s |
| size=64.0Kib, N=16384 | 22.534 Gib/s | 10.088 Gib/s | 7.931 Gib/s | 3.974 Gib/s | 2.782 Gib/s |
| size=128.0Kib, N=8192 | 29.728 Gib/s | 10.032 Gib/s | 7.934 Gib/s | 3.975 Gib/s | 2.782 Gib/s |
| size=256.0Kib, N=4096 | 29.742 Gib/s | 10.116 Gib/s | 7.625 Gib/s | 3.886 Gib/s | 2.781 Gib/s |
| size=512.0Kib, N=2048 | 29.994 Gib/s | 10.090 Gib/s | 7.627 Gib/s | 3.985 Gib/s | 2.785 Gib/s |
| size=1.0Mib, N=1024 | 11.760 Gib/s | 10.091 Gib/s | 7.959 Gib/s | 3.989 Gib/s | 2.787 Gib/s |
Under the current threading proposal, to share a module between multiple agents, the module must be instantiated multiple times: once per agent. Instantiation initializes linear memory with the contents in the module's data segments. If the memory is shared between multiple agents, it will be initialized multiple times, potentially overwriting stores that occurred after the previous initializations.
For example:
;; The module. (module (memory (export "memory") 1) ;; Some value used as a counter. (data (i32.const 0) "\0") ;; Add one to the counter. (func (export "addOne") (i32.store8 (i32.const 0) (i32.add (i32.load8_u (i32.const 0)) (i32.const 1))) ) )
// main.js let moduleBytes = ...; WebAssembly.instantiate(moduleBytes).then( ({module, instance}) => { // Increment our counter. instance.exports.addOne(); // Spawn a new Worker. let worker = new Worker('worker.js'); // Send the module to the new Worker. worker.postMessage(module); }); // worker.js function onmessage(event) { let module = event.data; // Use the module to create another instance. WebAssembly.instantiate(module).then( (instance) => { // Oops, our counter has been clobbered. }); }
This can be worked around by storing the data segments in a separate module which is only instantiated once, then exporting this memory to be used by another module that contains only code. This works, but it cumbersome since it requires two modules where one should be enough.
When discussing the design of Conditional Segment Initialization, we found that programmatic memory initialization from a read-only data segment (via the memory.init instruction, described below) has similar behavior to the proposed instruction to copy memory regions from linear memory (memory.copy, also described below.)
Copying between regions in linear memory or a table is accomplished with the new *.copy instructions:
memory.copy: copy from one region of linear memory to anothertable.copy: copy from one region of a table to anotherFilling a memory region can be accomplished with memory.fill:
memory.fill: fill a region of linear memory with a given byte valueThe binary format for the data section currently has a collection of segments, each of which has a memory index, an initializer expression for its offset, and its raw data.
Since WebAssembly currently does not allow for multiple memories, the memory index of each segment must be zero. We can repurpose this 32-bit integer as a flags field where new meaning is attached to nonzero values.
When the low bit of the new flags field is 1, this segment is passive. A passive segment will not be automatically copied into the memory or table on instantiation, and must instead be applied manually using the following new instructions:
memory.init: copy a region from a data segmenttable.init: copy a region from an element segmentA passive segment has no initializer expression, since it will be specified as an operand to memory.init or table.init.
Segments can also be shrunk to size zero by using the following new instructions:
data.drop: discard the data in an data segmentelem.drop: discard the data in an element segmentAn active segment is equivalent to a passive segment, but with an implicit memory.init followed by a data.drop (or table.init followed by a elem.drop) that is prepended to the module's start function.
Additionally, the reference-types proposal introduces the notion of a function reference (a function whose address is a program value). To support this, element segments can have several encodings, and can also be used to forward-declare functions whose address will be taken; see below.
The reference-types proposal also introduces the bulk instructions table.fill and table.grow, both of which take a function reference as an initializer argument.
The meaning of the bits of the flag field (a varuint32) for data segments is:
| Bit | Meaning |
|---|---|
| 0 | 0=is active, 1=is passive |
| 1 | if bit 0 clear: 0=memory 0, 1=has memory index |
which yields this view, with the fields carried by each flag value:
| Flags | Meaning | Memory index | Offset in memory | Count | Payload |
|---|---|---|---|---|---|
| 0 | Active | init_expr | varuint32 | u8* | |
| 1 | Passive | varuint32 | u8* | ||
| 2 | Active with memory index | varuint32 | init_expr | varuint32 | u8* |
All other flag values are illegal. At present the memory index must be zero, but the upcoming multi-memory proposal changes that.
The meaning of the bits of the flag field (a varuint32) for element segments is:
| Bit | Meaning |
|---|---|
| 0 | 0=is active, 1=is passive |
| 1 | if bit 0 clear: 0=table 0, 1=has table index |
| if bit 0 set: 0=active, 1=declared | |
| 2 | 0=carries indices; 1=carries elemexprs |
which yields this view, with the fields carried by each flag value:
| Flag | Meaning | Table index | Offset in table | Encoding | Count | Payload |
|---|---|---|---|---|---|---|
| 0 | Legacy active, funcref externval | init_expr | varuint32 | idx* | ||
| 1 | Passive, externval | extern_kind | varuint32 | idx* | ||
| 2 | Active, externval | varuint32 | init_expr | extern_kind | varuint32 | idx* |
| 3 | Declared, externval | extern_kind | varuint32 | idx* | ||
| 4 | Legacy active, funcref elemexpr | init_expr | varuint32 | elem_expr* | ||
| 5 | Passive, elemexpr | elem_type | varuint32 | elem_expr* | ||
| 6 | Active, elemexpr | varuint32 | init_expr | elem_type | varuint32 | elem_expr* |
| 7 | Declared, elemexpr | elem_type | varuint32 | elem_expr* |
All other flag values are illegal. Note that the “declared” attribute is not used by this proposal, but is used by the reference-types proposal.
The extern_kind must be zero, signifying a function definition. An idx is a varuint32 that references an entity in the module, currently only its function table.
At present the table index must be zero, but the reference-types proposal introduces a notion of multiple tables.
An elem_expr is like an init_expr, but can only contain expressions of the following sequences:
| Binary | Text | Description |
|---|---|---|
0xd0 0x0b | ref.null end | Returns a null reference |
0xd2 varuint32 0x0b | ref.func $funcidx end | Returns a reference to function $funcidx |
In the MVP, segments are initialized during module instantiation. If any segment would be initialized out-of-bounds, then the memory or table instance is not modified.
This behavior is changed in the bulk memory proposal:
Each active segment is initialized in module-definition order. For each segment, if reading the source or writing the destination would go out of bounds, then instantiation fails at that point. Data that had already been written for previous (in-bounds) segments stays written.
memory.init instructionThe memory.init instruction copies data from a given passive segment into a target memory. The target memory and source segment are given as immediates.
The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:
It is a validation error to use memory.init with an out-of-bounds segment index.
A trap occurs if:
data.dropThe order of writing is unspecified, though this is currently unobservable.
Note that it is allowed to use memory.init on the same data segment more than once.
data.drop instructionThe data.drop instruction shrinks the size of the segment to zero. After a data segment has been dropped, it can still be used in a memory.init instruction, but only a zero-length access at offset zero will not trap. This instruction is intended to be used as an optimization hint to the WebAssembly implementation. After a memory segment is dropped its data can no longer be retrieved, so the memory used by this segment may be freed.
It is a validation error to use data.drop with an out-of-bounds segment index.
memory.copy instructionCopy data from a source memory region to destination region. The regions are said to overlap if they are in the same memory and the start address of one region is one of the addresses that's read or written (by the copy operation) in the other region.
This instruction has two immediate arguments: the source and destination memory indices. They currently both must be zero.
Copying takes place as if an intermediate buffer were used, allowing the destination and source to overlap.
The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:
A trap occurs if:
The bounds check is performed before any data are written.
memory.fill instructionSet all bytes in a memory region to a given byte. This instruction has an immediate argument of which memory to operate on, and it must be zero for now.
The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:
A trap occurs if:
The bounds check is performed before any data are written.
table.init, elem.drop, and table.copy instructionsThe table.* instructions behave similarly to the memory.* instructions, with the difference that they operate on element segments and tables, instead of data segments and memories. The offset and length operands of table.init and table.copy have element units instead of bytes as well.
Consider if there are two data sections, the first is always active and the second is conditionally active if global 0 has a non-zero value. This could be implemented as follows:
(import "a" "global" (global i32)) ;; global 0 (memory 1) (data (i32.const 0) "hello") ;; data segment 0, is active so always copied (data "goodbye") ;; data segment 1, is passive (func $start (if (global.get 0) ;; copy data segment 1 into memory 0 (the 0 is implicit) (memory.init 1 (i32.const 16) ;; target offset (i32.const 0) ;; source offset (i32.const 7)) ;; length ;; The memory used by this segment is no longer needed, so this segment can ;; be dropped. (data.drop 1)) ) (start $start)
All bulk memory instructions are encoded as a 0xfc prefix byte, followed by another opcode, optionally followed by more immediates:
instr ::= ...
| 0xfc operation:uint8 ...
| Name | Opcode | Immediate | Description |
|---|---|---|---|
memory.init | 0xfc 0x08 | segment:varuint32, memory:0x00 | copy from a passive data segment to linear memory |
data.drop | 0xfc 0x09 | segment:varuint32 | prevent further use of passive data segment |
memory.copy | 0xfc 0x0a | memory_dst:0x00 memory_src:0x00 | copy from one region of linear memory to another region |
memory.fill | 0xfc 0x0b | memory:0x00 | fill a region of linear memory with a given byte value |
table.init | 0xfc 0x0c | segment:varuint32, table:0x00 | copy from a passive element segment to a table |
elem.drop | 0xfc 0x0d | segment:varuint32 | prevent further use of a passive element segment |
table.copy | 0xfc 0x0e | table_dst:0x00 table_src:0x00 | copy from one region of a table to another region |
DataCount sectionThe WebAssembly binary format is designed to be validated in a single pass. If a section requires information to validate, it is guaranteed that this information will be present in a previous section.
The memory.{init,drop} instructions break this guarantee. Both of these instructions are used in the Code section. They each have a data segment index immediate, but the vector of data segments is not available until the Data section is parsed, which occurs after the Code section.
To keep single-pass validation, the number of data segments defined in the Data section must be available before the Code section. This information is provided in a new DataCount section with the code 12.
Like all sections, the DataCount section is optional. If present, it must appear in the following order:
| Section Name | Code | Description |
|---|---|---|
| Type | 1 | Function signature declarations |
| Import | 2 | Import declarations |
| Function | 3 | Function declarations |
| Table | 4 | Indirect function table and other tables |
| Memory | 5 | Memory attributes |
| Global | 6 | Global declarations |
| Export | 7 | Exports |
| Start | 8 | Start function declaration |
| Element | 9 | Elements section |
| DataCount | 12 | Data segment count |
| Code | 10 | Function bodies (code) |
| Data | 11 | Data segments |
The DataCount section has just one field that specifies the number of data segments in the Data section:
| Field | Type | Description |
|---|---|---|
| count | varuint32 | count of data segments in Data section |
The binary is malformed if count is not equal to the number of data segments in the Data section. The binary is also malformed if the DataCount section is omitted and a memory.init or data.drop instruction is used.