spectec/doc/Overview.md

Specifying Wasm with SpecTec

Motivation

A language specification is the contract between users and implementations of a language. Precision and correctness of this contract are critical when portability and safety matter. Many programming languages therefore have elaborate definitions.

Wasm goes further in that it incorporates a formal specification, meaning that it is defined with mathematical rigor, and its correctness is amenable to formal and computer-verified proof. At the same time, the Wasm specification is supposed to be accessible to a diverse audience that is not necessarily trained in formal semantics. Hence, the language standard also contains a prose specification, which is written in an algorithmic form of stylised English.

Both formulations are presented side by side in the Wasm standard, with the intention that they are equivalent. In practice, most of the prose is a manual transliteration of the formal rules into English. The downside of this approach is that producing the specification or any extension to it (such as feature proposals) requires duplicate work, in a process that is both laborious and error-prone. This isn't helped by the input format of the spec — reStructuredText and Latex, both of which are tedious to produce and challenging to review.

(Even more manual work is necessary to produce a version of the formal semantics and its correctness proofs that is understandable for computer-aided theorem provers. Such a so-called mechanisation of the Wasm language specification has been produced by the WasmCert project — in fact, two of them, for different theorem provers.)

SpecTec is an authoring tool for the Wasm spec that reduces the effort for producing these multiple artefacts. Instead of multiple formulations, spec authors maintain only a succinct single source of truth, while the other artefacts are automatically generated by SpecTec, including most of the English prose. At the same time, SpecTec can detect various common errors in the input, increasing the degree of trust in the artefacts' consistency and correctness.

SpecTec takes a specification of Wasm's syntax and semantics in its domain-specific language and from that can generate different outputs through multiple backends:

formal rules in Latex,
prose rules in reStructuredText,
(ongoing work) formulations in theorem provers such as Coq.

Additional backends are possible in the future, e.g., for generating the OCaml of the reference interpreter, or for generating .wast tests automatically derived from the specification.

Introduction

SpecTec is designed to closely resemble the formal style used in the Wasm specification — which in turn is closely based on pen-and-paper notation used widely in literature and textbooks on programming language semantics. It is a declarative formulation that is concise and precise, enabling fast turnaround and effective processing. The goal is to approximate a WYSIWYG user experience, where authors write formal rules (and reviewers see them) in the same notation as will later appear on screen, but in plain and readable (and diff'able) ASCII.

To that end, SpecTec builds in much of the meta-level notation used by the Wasm specification, such as basic arithmetics, meta-variables, the notion of records, iterated sequences. On the other hand, SpecTec does not hard-code any Wasm-level concepts (though its backends specialize for it to a varying degree). Instead, there are a few generic mechanisms for describing the language and its semantics:

Syntax definitions, which can capture abstract syntax of a language or auxiliary constructs;
Grammars, which express binary or text syntax and its translation into abstract syntax;
Relations and respective rules, which can specify typing, evaluation, or other predicates;
Functions, which enable auxiliary meta-level definitions.

This is best seen in an example.

NanoWasm in SpecTec

Let us consider a small, simplified excerpt from the Wasm specification.

Abstract Syntax

The following SpecTec snippet defines the abstract syntax of indices, types and instructions:

syntax localidx = nat
syntax globalidx = nat

syntax mut = MUT
syntax valtype = I32 | I64 | F32 | F64
syntax functype = valtype* -> valtype*
syntax globaltype = mut? valtype

syntax const = nat

syntax instr =
    | NOP
    | DROP
    | SELECT
    | CONST valtype const   hint(show %.CONST %)
    | LOCAL.GET localidx
    | LOCAL.SET localidx
    | GLOBAL.GET globalidx
    | GLOBAL.SET globalidx

First, this specifies the representation of local and global indices as natural numbers, expressed by the built-in syntax type nat.

It then defines the representation of a few simple types: A value type is one of four numeric types, represented by user-space keywords. Such keywords are expressed by so-called atoms written in all caps, distinguishing them from meta-level variables.

Function types consist of a sequence of value types for their parameters and another for their results. The arrow is just custom meta-level notation, that is, a symbolic atom that can be used freely in the definition of syntax. (However, unlike alpha-numeric atoms, the set of recognised symbolic atoms is fixed in SpecTec; it roughly consists of all the notational symbols that are used in the Wasm spec.)

Global types are value types that are optionally marked with a mutability attribute, expressed by another atom.

An instruction can be one of many forms, some of which have additional parameters representing the instruction's “immediates”. They are defined by giving their SpecTec syntax type.

These definitions closely reflect the abstract syntax given in Section 2 of the Wasm spec. And indeed, the Latex backend can generate that exact style of definitions from it, as we will see later.

The show hint in the definition of the CONST instruction customises how this particular case is rendered in Latex to match the concrete instruction syntax; each % will be substituted by the value of one of its immediates. This way, (CONST I32 7) renders as (I32.CONST 7), for example.

For simplicity, we assume that all NanoWasm constants are represented as natural numbers.

Validation

Next, we want to specify validation, that is, typing rules for instructions. For that purpose, we first need to define a suitable form of typing contexts. This again is a syntax definition, but this time describing a record for the different namespaces:

syntax context = { GLOBALS globaltype*, LOCALS valtype* }

var C : context

Each namespace simply contains a list of respective types, one for each known index in that space.

We further declare that the meta-variable named C has type context everywhere. Usually, the types of meta-variables can be inferred by SpecTec, but sometimes it is useful for documenting naming conventions and achieve additional error checking to declare them upfront. In this particular case, the declaration also is necessary because an upper-case identifier like C would otherwise be parsed as an atom by SpecTec, not a variable.

Typing itself is given by a relation that we name Instr_ok, and which is defined over a triple of a context, instruction, and a function type describing the instruction‘s effect on Wasm’s operand stack. We can declare this relation and its notation as follows:

relation Instr_ok: context |- instr : functype

The stylized turnstile and colon used as separators are again just atoms, i.e., user-defined notation within SpecTec, chosen to make the relation more readable and reminiscent of standard paper notation.

The relation is then defined by individual rules, each one describing how to type one individual instruction:

rule Instr_ok/nop:
    C |- NOP : eps -> eps

rule Instr_ok/drop:
    C |- DROP : t -> eps

rule Instr_ok/select:
    C |- SELECT : t t I32 -> t

rule Instr_ok/const:
    C |- CONST t c : eps -> t

rule Instr_ok/local.get:
    C |- LOCAL.GET x : eps -> t
    -- if C.LOCALS[x] = t

rule Instr_ok/local.set:
    C |- LOCAL.SET x : t -> eps
    -- if C.LOCALS[x] = t

rule Instr_ok/global.get:
    C |- GLOBAL.GET x : eps -> t
    -- if C.GLOBALS[x] = MUT? t

rule Instr_ok/global.set:
    C |- GLOBAL.GET x : t -> eps
    -- if C.GLOBALS[x] = MUT t

Each rule definition is headed by the name of its respective relation, plus a symbolic name for the rule itself, used later to reference individual rules. The notation of a rule's body has to match the notation declared for the relation.

The rule for NOP specifies that both input and output types of the instruction are empty — eps is a SpecTec keyword representing the empty sequence (short for “epsilon” and rendered later as ε).

The DROP instruction consumes a single value of arbitrary type. It is arbitrary because the meta-variable t is unrestricted by the rule, other than it having to stand for a value type, which is inferred from the types of the relation and the syntax of function types.

In contrast, SELECT is specified to consume three values, two of which are of arbitrary but same type t, the other an I32. The result is a single t.

The rules for accessing local and global variables are a bit more interesting, as they have side conditions that look up the types of the respective variables in the context C. Side conditions represent premises of a rule.

These rules demonstrate several other features of SpecTec:

Records can be accessed by dot notation: C.LOCALS returns the field LOCALS of record C, in this case a sequence of value types.
Elements of sequences can be indexed by the notation e[i]; here that allows us to select the type of the respective local.
Equations can use patterns: the right-hand side of the GLOBAL.GET rule checks that the looked up global type is of the form MUT? t, i.e., an optional MUT flag and then some value type, which is named t in the rule. In contrast, GLOBAL.SET requires MUT to be present.

Execution

Execution is defined as a small-step reduction relation. It is defined over a configuration, which is a pair of the current state and the instruction sequence to reduce. The state, in turn, consists of the global store and the local function call frame; besides locals, the latter holds a reference to the current module instance:

syntax addr = nat
syntax moduleinst = { GLOBALS addr* }

syntax val = CONST valtype const

syntax store = { GLOBALS val* }
syntax frame = { LOCALS val*, MODULE moduleinst }
syntax state = store; frame
syntax config = state; instr*

var f : frame
var s : store
var z : state

Locals are allocated in the current function frame, such that their values can be accessed there directly. Globals on the other are allocated in the store and referenced via global addresses. The module instance in a frame is used to map a global‘s index as seen inside that module to its actual runtime address in the store. This indirection is necessary in Wasm because a module can be instantiated multiple times, yielding multiple copies of its locals at different addresses. (Module instantiation isn’t part of NanoWasm, but we follow the structure of the proper Wasm state anyway.)

The notation for the state and configuration pairs again is customised, with a semicolon as a symbolic atom separating their two components.

We also define values as a syntactic subset of instructions, which are the ones that cannot be reduced any further: SpecTec automatically recognises that the type val defines a subtype of instr because it contains a subset of its cases. Hence, values can be used in place of instructions, which we make use of below.

In order to ease notation in the reduction rules to come, we further define a couple of short-hands for accessing locals and globals as auxiliary meta-functions:

def $local(state, localidx) : val
def $local((s; f), x) = f.LOCALS[x]

def $global(state, globalidx) : val
def $global((s; f), x) = s.GLOBALS[f.MODULE.GLOBALS[x]]

def $update_local(state, localidx, val) : state
def $update_local((s; f), x, v) = s; f[.LOCALS[x] = v]

def $update_global(state, globalidx, val) : state
def $update_global((s; f), x, v) = s[.GLOBALS[f.MODULE.GLOBALS[x]] = v]; f

Meta-definitions are distinguished from meta-variables by a preceding $ in their name. They are given by first declaring their type, and then giving their definition in the form of one or multiple equational clauses. Function arguments on the left-hand side of these clauses can be patterns that bind variables, like the patterns for the state parameter in the definitions above.

The definitions $update_local and $update_global modify the state of a respective variable. They use another piece of SpecTec notation: f[.LOCALS[x] = v], for example, returns a copy of f in which the value of f.LOCALS[x] is replaced with v. This notation extends to arbitrary deep paths of field selection and indexing.

Definitions can also have zero parameters, in which case they define constants and the parentheses can be omitted. For example:

def $zero32 : val
def $zero32 = CONST I32 0

The actual reduction is defined by two stepping relations, declared as follows:

relation Step: config ~> config
relation Step_pure: instr* ~> instr*

The first is the main relation that defines a single step on configurations, while the second is a simpler variant used to define pure reductions that do not access the state. There is one rule that connects the two relations:

rule Step/pure:
    z; instr*  ~>  z; instr'*
    -- Step_pure: instr* ~> instr'*

This rule expresses that a configuration can take step if it can take a pure step according to the second relation, which results in the state z remaining unchanged. The line starting with -- is another premise, but unlike the Boolean side conditions we saw above, it inductively invokes another relation, which has to be named explicitly.

The NOP, DROP, and SELECT instructions are pure, such that they can be formulated with this simpler relation:

rule Step_pure/nop:
    NOP  ~>  eps

rule Step_pure/drop:
    val DROP  ~>  eps

rule Step_pure/select-true:
    val_1 val_2 (CONST I32 c) SELECT  ~>  val_1  -- if c =/= 0

rule Step_pure/select-false:
    val_1 val_2 (CONST I32 c) SELECT  ~>  val_2  -- otherwise

NOP does nothing, and consequently, is simply reduced to the empty instruction sequence.

DROP likewise, but it also eliminates a value instruction before it, which amounts to popping that value from the stack. By naming the meta-variable for the preceding instruction val, it is constrained to be of type val, hence it only matches values.

SELECT is defined by two cases with different side conditions inspecting the selector value c; where =/= expresses inequality. The side condition otherwise is a short-hand negating all premises for all previous rules with a left-hand side of the same shape. In this case, it amounts to the condition if c = 0. The SELECT instruction pops two values from the stack, but pushes one back, depending on the condition. Note how we can extend the names of the value variables with subscripts, which still constrains them to type val.

The rules for LOCAL.GET and GLOBAL.GET need to access the state, hence they are not pure and need to be defined via the general relation:

rule Step/local.get:
    z; (LOCAL.GET x)  ~>  z; val
    -- if val = $local(z, x)

rule Step/local.set:
    z; val (LOCAL.SET x)  ~>  z'; eps
    -- if z' = $update_local(z, x, val)

rule Step/global.get:
    z; (GLOBAL.GET x)  ~>  z; val
    -- if val = $global(z, x)

rule Step/global.get:
    z; val (GLOBAL.SET x)  ~>  z'; eps
    -- if z' = $update_global(z, x, val)

Here we are invoking the auxiliary meta-functions previously defined to read and write the value of the respective variable in the state.

Finally, to allow SpecTec to produce adequate output, we add the following hints to the script:

relation Step hint(tabular)
relation Step_pure hint(tabular)

These instruct the Latex backend to render these relations in clausal form, rather than as inference rules. The backend would reject the use of otherwise conditions if this format wasn't selected.

Binary Format

Finally, we want to define the binary format of NanoWasm‘s instruction set. That’s the purpose of SpecTec's last feature, namely attribute grammars.

Let's start simple with the binary grammars for types:

grammar Bvaltype : valtype =
    | 0x7F => I32
    | 0x7E => I64
    | 0x7D => F32
    | 0x7C => F64

grammar Bmut : mut? =
    | 0x00 => eps
    | 0x01 => MUT

grammar Bglobaltype : globaltype =
    | t:Bvaltype mut:Bmut => mut t

grammar Bresulttype : valtype* =
    | n:Bu32 (t:Bvaltype)^n => t^n

grammar Bfunctype : functype =
    | 0x60 t_1*:Bresulttype t_2*:Bresulttype => t_1* -> t_2*

Each definition specifies the grammar for one syntactic category as declared. It consists of one or several productions, each specifying one possible binary representation on the left-hand side, and the abstract syntax it yields on the right. For example, the grammar named Bvaltype specifies the binary format for value types, where the binary representation just consists of fixed byte values that map to the abstract representation for valtype that we defined earlier.

Global types are just value types followed by their mutability flag, which is defined in an auxiliary definition. The notation “t:” binds t to the abstract syntax produced by parsing the Bvaltype, and likewise for mut.

Function types are encoded by two sequences of value types, headed by a single byte of value 0x60. An auxiliary grammar defines result types as an u32 value, followed by a sequence of value types of that length. (The definition of Bu32 is a bit more involved so let us defer it for a moment.) Each value type in the sequence Bvaltype^n is bound to a t, and because t appears inside the iteration ^n, we end up with a sequence of t^n value types that is ultimately produced by this grammar.

Getting to instructions, they are encoded by their respective byte code, followed by the necessary immediates:

grammar Bglobalidx : globalidx = x:Bu32 => x
grammar Blocalidx : localidx = x:Bu32 => x

grammar Binstr : instr =
  | 0x01 => NOP
  | 0x1A => DROP
  | 0x1B => SELECT
  | 0x20 x:Blocalidx => LOCAL.GET x
  | 0x21 x:Blocalidx => LOCAL.SET x
  | 0x23 x:Bglobalidx => GLOBAL.GET x
  | 0x24 x:Bglobalidx => GLOBAL.SET x
  | 0x41 n:Bu32 => CONST I32 n
  | 0x42 n:Bu64 => CONST I64 n
  | 0x43 p:Bf32 => CONST F32 p
  | 0x44 p:Bf64 => CONST F64 p

At this point, the only piece missing is the promised definition of Bu32 and the other number representations:

grammar Bbyte : nat = b:0x00 | ... | b:0xFF => b

grammar Bu(N : nat) : nat =
  | n:Bbyte              => n                       -- if $(n < 2^7 /\ n < 2^N)
  | n:Bbyte m:Bu($(N-7)) => $(2^7 * m + (n - 2^7))  -- if $(n >= 2^7 /\ N > 7)

grammar Bu32 : const = n:Bu(32) => n
grammar Bu64 : const = n:Bu(64) => n

First, an auxiliary grammar for generic byte values is declared. The “x | ... | y” defines a numeric range, in this case, the whole value range of a byte.

The actual number representations use LEB128 encodings with a size-dependent length limit. In order to avoid repeating it for different lenghts, the definition uses a grammar that is parameterised over the bit-width N. The productions of the grammar have side conditions that restrict when they apply. The single byte representation is only applicable if its value is smaller than 128, meaning that its most-significant bit isn't set. This condition is part of the LEB128 format. Furthermore, we require that the value also must be smaller than the value range of number with the requested bitwidth. The notation $( ... ) is just SpecTec syntax enclosing an arithmetic expression. Such an explicit escape is necessary, because the use of operators like * and ^ would otherwise be syntactically ambiguous between their meanings as arithmetic operators and that as regular expressions.

Larger numbers require multiple bytes. This case is defined recursively: first pass one byte, whose value must be larger than or equal to 128, that is, its most-significant bit is set. It is also required that N is large enough to allow multiple bytes. The remainder then is parsed as a binary number of the remaining bitwidth N-7, and in the end, both parts of the result value are combined together to form the overall result.

Finally, the grammar for floating point values is also parameterised over its bit-width, but is just a fixed-length sequence of bytes:

grammar Bf(N : nat) : nat =
  | b*:Bbyte^(N/8) => $float(N, b*)

grammar Bf32 : const = p:Bf(32) => p
grammar Bf64 : const = p:Bf(64) => p

This assumes the presence of the meta-function $float, which would compute the float value represented by these bytes. Currently, the SpecTec specification of Wasm does not yet include numeric primitives, so this function (which is called $invfbytes in the actual spec) isn't actually defined via SpecTec yet. It is only declared and then implemented manually inside the tool.

Splicing Generated Definitions

With the specification of NanoWasm above, we could now create a document that includes the necessary definitions in both formal notation and prose. SpecTec supports this by applying it to document templates that contain splices of individual definitions.

Such documents are assumed to be in reStructuredText, since that is the format that the Wasm spec is currently using.

For example, assuming the above definition of NanoWasm is collected in a file NanoWasm.spectec, then we can use the command

spectec NanoWasm.spectec --splice-sphinx -p NanoWasm.rst.in -o NanoWasm.rst

to feed the following .rst.in file to SpecTec and produce proper .rst output:

NanoWasm
========

*NanoWasm* is a small language with simple types and instructions.

Abstract Syntax
---------------

The *abstract syntax* of types is as follows:

$${syntax: mut valtype functype globaltype}

Instructions take the following form:

$${syntax: const instr}

The instruction ${:NOP} does nothing,
${:DROP} removes an operand from the stack,
${:SELECT} picks one of two operands depending on a condition value.
The instruction ${instr: CONST t c} pushes the constant ${:c} to the stack.
The remaining instructions access local and global variables.

Validation
----------

NanoWasm instructions are *type-checked* under a context that assigns types to indices:

$${syntax: context}

${:NOP}
.......

$${rule-prose: Instr_ok/nop}
$${rule: Instr_ok/nop}

${:DROP}
........

$${rule-prose: Instr_ok/drop}
$${rule: Instr_ok/drop}

${:SELECT}
..........

$${rule-prose: Instr_ok/select}
$${rule: Instr_ok/select}

[...and so on...]

Two forms of splices occur in this example:

Expression splices (${typ: exp}). The body of these splices are SpecTec expressions. They are rendered into Latex formulas and inserted inline in place. An expression is prefixed by the name of its syntax type. This type can be omitted if it can be inferred or does not matter, but then SpecTec renders the expression as is, without attention to show or macro hints that might be attached to a type.
Definition splices ($${sort: name*}). These splices render and insert (a set of) SpecTec definitions in separate lines or paragraphs. The definitions are identified by their sort (e.g., syntax, rule) and their names, and rendered accordingly. The names may contain wildcards, which expand to all known rule names matching it.

Rule definitions can be spliced in two different forms, either as formal math, using the sort rule, or as English prose, using the sort rule-prose. In the above example, like in the actual Wasm spec, we insert both forms next to each other.

When multiple (non-prose) definitions are spliced together, then their formulas are laid out to align.

Note: A complete version of the example with a Makefile for running Sphinx and producing both HTML and PDF from it can be found in the example directory). The generated .rst file with can be seen here, as well as the final output in HTML or PDF format.