breadcrumbs: PNaCl >
This document describes the proposal to add aligned instruction bundle support (in short - “bundling”) in LLVM and its implementation in the MC module.
For the purpose of supporting the Software Fault Isolation (SFI) mechanisms required by Native Client, the following directives are added to the LLVM assembler:
.bundle_align_mode <num>
.bundle_lock <option>
.bundle_unlock
With the following semantics:
When aligned instruction bundle mode (“bundling” in short) is enabled (.bundle_align_mode
was encountered with an argument > 0, which is the power of 2 to which the bundle size is equal), single instructions and groups of instructions between .bundle_lock
and .bundle_unlock
directives cannot cross a bundle boundary.
Furthermore, the .bundle_lock
directive supports the align_to_end
option, which means that the group has to end at a bundle boundary.
For example, consider the following:
.bundle_align_mode 4 mov1 mov2 mov3
Assuming that each of the mov
instructions is 7 bytes long and mov1
is aligned to a 16-byte boundary, two bytes of NOP padding will be inserted between mov2
and mov3
to make sure that mov3
does not cross a 16-byte bundle boundary.
A slightly modified example:
.bundle_align_mode 4 mov1 .bundle_lock mov2 mov3 .bundle_unlock
Here, since the bundle-locked sequence mov2 mov3
cannot cross a bundle boundary, 9 bytes of NOP padding will be inserted between mov1
and mov2
. An example to demonstrate the align_to_end
option:
.bundle_align_mode 4 mov1 mov2 .bundle_lock align_to_end mov3 mov4 .bundle_unlock
Normally, only two bytes of NOP padding would be required between mov2
and mov3
to ensure that bundle-locked sequence does not cross a bundle boundary. However, since align_to_end
was provided, an additional two bytes of NOP padding will be inserted so that the sequence ends at a boundary.
For information on how this ability is used for software fault isolation by Native Client, see the following resources:
As proposed, bundling is a feature of the assembler. Therefore, it is implemented in the MC module of LLVM. Specifically, the following parts are affected:
The following description will focus on the path: Text assembly -> ELF object streamer -> Assembly -> Object file emission. This path can be roughly divided to three stages:
MCSectionData
) and fragments (MCFragment
derivatives) that represent the input.In order to implement bundling, we use the existing assembly parsing facilities in MC, adding support for the new directives. The existing section and fragment abstractions are used, with some flags added to keep the state of the bundling directives encountered. Specifically:
BundleAlignSize
field is added to MCAssembler
. When the .bundle_align_mode
directive is parsed, this field is populated with the bundle alignment size (2 to the power of the argument of .bundle_align_mode
). Setting this field is currently allowed once per assembly file. Subsequent .bundle_align_mode
directives will be rejected as errors..bundle_lock
and .bundle_unlock
:BundleLockState
: keeps track of whether the currently parsed code is in a bundle-locked group (between .bundle_lock
and .bundle_unlock
directives) and whether the group has to be aligned to bundle end.BundleGroupBeforeFirstInst
: keeps track of whether the next instruction parsed will be the first in a bundle-locked group..bundle_lock
is encounteredWhen bundling mode is turned on in the assembler (BundleAlignSize
> 0), the following rules apply to emitting fragments:
Bundling blends well into the existing layout mechanism in the MC assembler, since its effects are somewhat similar to relaxation. Some fragments may need to grow due to padding, which may require re-layout of subsequent fragments and recomputation of fixups. Therefore, the MC assembler employs an iterative layout algorithm. The following diagram will help explain the layout of fragments w.r.t. bundling:
hasInstructions
is added to MCFragment
BundlePudding
field of the fragment. This field is restricted to uint8_t
, in order to save space in the mainstream case (bundling disabled). In practice, fragments of larger size will not be encountered (bundling is done at most on small groups of a few instructions).fragment offset + fragment size = offset of next fragment
align_to_end
option is provided to a bundle-locked group.Note that since we create a single data fragment for a bundle-locked group, the above applies to such groups as well as single instructions.
Note: the write target of the assembler is abstracted into a stream object, which can also write into memory. “Object file” here implies this stream.
In the last step, the assembler writes the list of fragments into sections according to their layout order. In this step, the BundlePadding
field created during layout is used to add NOP padding (by calling writeNopData
on the target-specific assembler backend) of appropriate size before fragments that require it.
It‘s interesting to study the performance impact of adding the bundling feature on MC’s assembler normal operation (without actual bundling directives).
Amount of bytes needed for the various MCFragment objects on x86-64:
Explanation: the single-byte BundlePadding
field in MCEncodedFragment
was placed in space reserved for alignment earlier. The same was done for the boolean HasInstructions
in MCDataFragment
.
llvm-mc
was run on a large assembly file (produced by compiling gcc 3.5 into a single assembly file) as follows:
sudo nice -n -20 perf stat -r 10 llvm-mc -filetype=obj gcc.s -o gcc.o
There were no noticeable difference in the runtime of llvm-mc
with and without the bundling patch.
Alternative implementation
Bundling is similar to relaxation in many aspects which simplifies the implementation. While the performance impact of bundling is negligible as shown, the interaction of these two mechanisms can have a negative effect on memory consumption. When bundling is used, most instructions need to be put in their own fragment. That's because before we have decided the sizes of jumps, we don’t know the relative positions of other instructions, as we might need to insert bundle padding NOPs between them. The only exception are bundle-locked superinstruction sequences which can be stored in a single fragment.
Since there's a memory overhead for storing a fragment, the use of bundling with relaxation significantly increases memory overhead. When translating pexe, pnacl-llc uses K * nexe size memory. The experimental results show that the value of K is ~17 with the fixed cost of ~50MB. Given that, with the maximum pexe size limited currently to 64MB, pnacl-llc would need over 1.1GB of address space which is significantly more than 768MB that will be available when pnacl-llc starts using IRT.
Therefore, one way to reduce the memory consumption is to disable jump relaxation. In that case, we know that size of all instructions from the beginning and we can write the bundle padding NOPs directly into the fragments while emitting the instructions, thus reducing the number of fragments needed.
We have implemented the alternative bundle padding scheme in LLVM MC. Currently, this implementation is only being used when the -mc-relax-all flag is used. The large bulk of implementation is in the MCELFStreamer::mergeFragment method. We reuse the existing code and emit instructions into their own fragment. We also reuse the existing logic for calculating bundle boundaries and necessary bundle padding. However, when the jump relaxation is disabled, instead of adding each fragment into the list of fragments held by MCSectionData, we merge it with the current fragment.