docs: Add coroutine blog post

Change-Id: I473b693ce30e0b3a8f0032f0d63cfadbd76f0185
Reviewed-on: https://pigweed-review.googlesource.com/c/pigweed/pigweed/+/216111
Reviewed-by: Kayce Basques <kayce@google.com>
Commit-Queue: Taylor Cramer <cramertj@google.com>
Reviewed-by: Keir Mierle <keir@google.com>
Lint: Lint 🤖 <android-build-ayeaye@system.gserviceaccount.com>
diff --git a/docs/BUILD.gn b/docs/BUILD.gn
index 72a9036..a143a78 100644
--- a/docs/BUILD.gn
+++ b/docs/BUILD.gn
@@ -32,6 +32,8 @@
 # split out from the overall docgen target below.
 pw_doc_group("core_docs") {
   inputs = [
+    "$dir_pw_async2/examples/coro_blinky_loop.cc",
+    "$dir_pw_async2/public/pw_async2/coro.h",
     "$dir_pw_build/pigweed.bazelrc",
     "$dir_pw_build/py/gn_tests/BUILD.gn",
     "$dir_pw_build/py/pw_build/generate_python_wheel_cache.py",
@@ -46,6 +48,7 @@
     "blog/02-bazel-feature-flags.rst",
     "blog/03-pigweed-sdk.rst",
     "blog/04-fixed-point.rst",
+    "blog/05-coroutines.rst",
     "blog/index.rst",
     "blog/rss.xml",
     "code_of_conduct.rst",
diff --git a/docs/blog/05-coroutines.rst b/docs/blog/05-coroutines.rst
new file mode 100644
index 0000000..33a5ffb
--- /dev/null
+++ b/docs/blog/05-coroutines.rst
@@ -0,0 +1,499 @@
+.. _docs-blog-05-coroutines:
+
+=============================================================
+Pigweed Eng Blog #5: C++20 coroutines without heap allocation
+=============================================================
+*Published on 2024-10-07 by Taylor Cramer*
+
+Pigweed now provides support for coroutines on embedded
+targets!
+
+.. literalinclude:: ../../pw_async2/examples/coro_blinky_loop.cc
+   :language: cpp
+   :linenos:
+   :start-after: [pw_async2-examples-coro-blinky-loop]
+   :end-before: [pw_async2-examples-coro-blinky-loop]
+
+`Click here for more examples. <https://pigweed.googlesource.com/pigweed/pigweed/+/refs/heads/main/pw_async2/examples>`_
+
+Notably, Pigweed's :cpp:class:`pw::async2::Coro` API does not require heap
+allocation. This post details how we accompilished this, what challenges
+we encountered, and how you can get started using coroutines on embedded.
+
+If you're:
+
+* a C++ embedded developer looking for an ergonomic way to write async code
+* a Rust ``async`` / ``await`` user curious about C++
+* a C++ committee member who wants to make embedded developers happy
+
+read on!
+
+--------------------------------
+Why use coroutines for embedded?
+--------------------------------
+Coroutines allow for multiple functions to run concurrently.
+Compared to other concurrency options, coroutines have:
+
+* No need for thread stacks
+* No OS/RTOS-dependent context switching
+* No complex, manually written state machines
+
+The `coroutines API introduced in C++20
+<https://en.cppreference.com/w/cpp/language/coroutines>`_
+allows you to write
+`straight-line code <https://en.wikipedia.org/wiki/Straight-line_program>`_
+that can yield to the caller and resume at some later point. This can be used
+similarly to ``async`` / ``await`` from languages like Rust, C#, JavaScript,
+Python, and Swift.
+
+When a typical blocking function waits for a long-running operation to
+complete, the whole thread (and any resources associated with it) become
+blocked until the operation is done. By contrast, coroutine functions can
+yield to their caller and only resume once an asynchronous operation
+completes. The caller, often some kind of scheduler, is then free to run
+other code while the coroutine function waits for its result.
+
+This allows for executing many operations concurrently without paying the
+performance cost of threading or the cognitive cost of callbacks or manual
+state machines.
+
+---------------------------------------------
+Why C++ coroutines require dynamic allocation
+---------------------------------------------
+Unfortunately for embedded developers, the C++ coroutine API uses heap
+allocation by default. *Heap allocation* (e.g. ``new`` and ``delete`` or
+``malloc`` and ``free``) isn't available on many embedded platforms.
+Fortunately, heap allocation is not necessary to use C++ coroutines.
+
+However, C++20 coroutines do require *dynamic allocation*
+(memory that is allocated at runtime). Pigweed's
+:cpp:class:`pw::async2::Coro` API allocates memory using an
+:cpp:class:`pw::Allocator`. This avoids the need for a platform-provided
+heap allocator, but users may still encounter some of the costs associated
+with dynamic allocation, such as:
+
+* Heap fragmentation
+* Runtime memory management size and performance overhead
+* Runtime out-of-memory conditions
+* The need to "right-size" a heap
+
+This dynamic allocation is done in order to store the "coroutine frame", an
+object which holds the coroutine's state.
+
+The coroutine frame
+===================
+Despite not using a native thread stack, C++20 coroutines do need somewhere to
+store the "stack" of paused functions. For example, in our ``Blinky`` function
+from above:
+
+.. literalinclude:: ../../pw_async2/examples/coro_blinky_loop.cc
+   :language: cpp
+   :linenos:
+   :start-after: [pw_async2-examples-coro-blinky-loop]
+   :end-before: [pw_async2-examples-coro-blinky-loop]
+
+When we reach a ``co_await`` statement, the function is paused and control
+is returned to the scheduler until ``500ms`` passes. When paused, we still
+need to store the function's state, including the loop variable
+``i``, the function arguments, and a pointer to where we were in the function
+when it paused.
+
+This coroutine state (hereafter referred to as a "coroutine frame") typically
+requires much less memory than a full thread stack, as it can allocate exactly
+the amount of memory the function needs for variables that are "live" across
+``co_await`` points.
+
+.. admonition:: Layout of coroutine frames
+
+   The actual structure of this object is a compiler implementation
+   detail, but you can read more about how Clang structures it in
+   `The structure of coroutine frames <https://clang.llvm.org/docs/DebuggingCoroutines.html#coroutine-frame>`_.
+
+Static vs. dynamic allocation of the coroutine frame
+====================================================
+When a coroutine is first created, the C++ coroutine API dynamically allocates
+space to store the coroutine frame and returns a pointer to it called a
+`coroutine handle <https://en.cppreference.com/w/cpp/coroutine/coroutine_handle>`_.
+By contrast, `Rust's async functions
+<https://rust-lang.github.io/async-book/01_getting_started/04_async_await_primer.html>`_
+directly return coroutine frames inline, allowing for static allocation.
+Why did C++ make the choice to dynamically allocate? Some key reasons are that
+returning a pointer to dynamically allocated memory keeps the coroutine
+function's return value small, movable, and most importantly ABI-stable.
+
+ABI stability
+=============
+ABI stability is a particular challenge when designing coroutines. The coroutine
+frame contains the stack of the coroutine function, so the coroutine frame's
+size is dependent upon the number and size of the variables used in the function
+implementation.
+
+For example, two functions with this declaration can have differently sized
+coroutine frames:
+
+.. code-block:: c++
+
+   Coro<int> GimmeANumber();
+
+An implementation like this would have a very small coroutine frame:
+
+.. code-block:: c++
+
+   Coro<int> GimmeANumber() {
+     co_return 5;
+   }
+
+While this function would need a very large coroutine frame in order
+to store its temporary state:
+
+.. code-block:: c++
+
+   Coro<int> GimmeANumber() {
+     ABigValue a = co_await DownloadAllOfReddit();
+     AnotherOne b = co_await DownloadAllOfHackerNews();
+     co_return a.size() + b.size();
+   }
+
+If we were to inline the coroutine frame in the resulting ``Coro<int>`` object,
+that would mean that ``Coro<int>`` would have to be a different size depending
+on the function implementation. The two functions would therefore have different
+ABIs, and we would no longer be able to refer to ``Coro<int>`` as a single
+statically sized type.
+
+This would also make it impossible to call a coroutine function from a different
+translation unit without knowing the details of its implementation. Without
+additional compiler features, inlined coroutine frames would require that public
+coroutine functions be implemented in headers (similar to templated or
+``constexpr`` functions).
+
+Similarly, inlined coroutine frames would prevent coroutine functions from
+being ``virtual``, as different overrides would have different ABIs.
+
+Rust bites the bullet
+=====================
+Rust inlines the coroutine frame and runs headfirst into these challenges.
+The coroutine-style object returned by a Rust ``async fn`` is:
+
+* Immovable once the coroutine has started. This is necessary in order to
+  ensure that pointers to variables in the coroutine stack are not invalidated.
+* Dependent upon the implementation of the ``async`` function. Adding a new
+  variable in the body of the ``async`` function will require more "stack"
+  space, so the size and structure of the returned state machine will change
+  as the implementation of the ``async`` function changes.
+* Not usable with dynamic dispatch (Rust's ``virtual``) without an
+  intermediate layer which dynamically allocates the coroutine frame
+  (see the `async_trait <https://docs.rs/async-trait/latest/async_trait/>`_
+  library for one implementation of this pattern).
+* Challenging to compile. Since the size of the coroutine object is
+  dependent on the function implementation, code which uses the coroutine
+  function can't be compiled until the compiler has first determined the
+  size of the coroutine object.
+
+However, this tradeoff also means that Rust can statically allocate space
+for coroutines, avoiding all the pitfalls of dynamic allocation. The current
+C++ coroutine API doesn't allow for this.
+
+---------------------
+What Pigweed provides
+---------------------
+Despite the need for dynamic allocation, it is still possible to use coroutines
+on embedded devices. Today, Pigweed is using coroutines in the :ref:`showcase-sense`
+showcase for the Raspberry Pi RP2040 and (new!) RP2350 microcontrollers.
+
+In order to accomplish this, we needed the Pigweed :cpp:class:`pw::async2::Coro`
+API to:
+
+* Avoid heap allocation.
+* Allow recovering from allocation failure without using exceptions.
+* Allow using custom, context-specific
+  :cpp:class:`pw::Allocator` implementations on a per-coroutine basis.
+
+If you want to skip ahead, you can `see the Coro implementation here.
+<https://cs.opensource.google/pigweed/pigweed/+/main:pw_async2/public/pw_async2/coro.h>`_
+
+--------------------
+Eliminating the heap
+--------------------
+As I said before, the C++ coroutine API defaults to heap allocation.
+Fortunately for us embedded devs, the coroutine ``promise_type`` can customize
+the behavior of this allocation by implementing the following functions:
+
+* ``static void* operator new(std::size_t, Args...) noexcept``
+* ``static MyReturnObjectType get_return_object_on_allocation_failure()``
+* ``static void operator delete(void*)``
+
+Allocating memory
+=================
+Implementing ``operator new`` lets us control how coroutine memory is allocated.
+Our custom ``operator new`` takes two parameters: a ``std::size_t`` and an
+``Args...`` parameter pack. Let's explore how these are handled.
+
+The ``size`` parameter
+----------------------
+Unlike regular ``operator new`` or `placement new
+<https://en.cppreference.com/w/cpp/language/new#Placement_new>`_,
+the ``promise_type`` ``operator new`` function allocates space not just for
+the ``promise_type`` itself, but also for the compiler-generated coroutine
+stack. The size of this coroutine stack is determined by the compiler and is
+optimization-dependent, so it is not statically observable and is only available
+at runtime.
+
+The ``Args...`` parameter pack
+------------------------------
+The arguments to coroutine functions are also passed into the ``operator new``
+of the corresponding ``promise_type``. This allows ``operator new`` to "intercept"
+arguments passed to the coroutine function.
+
+This means that if we have some
+coroutine function:
+
+.. code-block:: c++
+
+   MyCoroutineType some_func(int a)
+
+and ``MyCoroutineType::promise_type`` has an ``operator new`` like:
+
+.. code-block:: c++
+
+   static void* operator new(std::size_t n, int a) noexcept
+
+then the argument ``a`` will be copied into the invocation of ``operator new``.
+
+:cpp:class:`pw::async2::Coro` uses this tool to allow the caller of a coroutine
+function to specify which memory allocator to use for the coroutine frame.
+
+Different coroutines may want to use memory provided by different allocators
+with various memory locations or allocation strategies. When allocating a
+regular type ``T``, we'd typically achieve this by pre-allocating ``sizeof(T)``
+with our custom allocator and then passing the resulting memory into the
+`placement new
+<https://en.cppreference.com/w/cpp/language/new#Placement_new>`_
+function of ``T``.
+
+However, unlike regular types, coroutine constructors or ``operator new`` are not
+directly available to the caller. Instead, the coroutine types are constructed
+by the compiler when the coroutine function is called.
+
+Furthermore, C++20 doesn't provide a way to know ahead of time how much memory
+needs to be allocated to store the coroutine state—that information is only
+available once our custom ``operator new`` is invoked with a ``size`` value.
+
+Instead, :cpp:class:`pw::async2::Coro` allows custom per-coroutine ``Allocator``
+values to be provided by intercepting an ``Allocator`` argument to the coroutine
+function. One can pass an allocator as the first argument of all coroutine
+functions. We can then use an ``Args...`` variadic to discard all other
+arguments to the coroutine function:
+
+.. code-block:: c++
+
+   template<typename... Args>
+   static void* operator new(std::size_t n,
+                             Allocator& alloc,
+                             const Args&...) noexcept {
+      return alloc.Allocate(n);
+   }
+
+The user's coroutine functions will then look like this:
+
+.. code-block:: c++
+
+   MyCoroutineType some_func(Allocator&, int a) { ... }
+
+The ``operator new`` implementation will then receive the caller-provided ``Allocator``
+reference and can use it to allocate the coroutine state.
+
+.. admonition:: ``CoroContext``
+
+   The allocator argument to :cpp:class:`pw::async2::Coro` is actually packaged inside a
+   :cpp:class:`pw::async2::CoroContext` in order to allow for greater evolvability, but
+   today this type is a simple wrapper over a :cpp:class:`pw::Allocator`.
+
+Handling allocation failure
+===========================
+Many embedded systems both disable exceptions and wish to recover gracefully
+from allocation failure. If this is the case, ``promise_type::operator new`` must
+be marked ``noexcept`` and return ``nullptr`` on allocation failure.
+
+Note that our coroutine function, however, doesn't have an obvious way to
+return ``nullptr``:
+
+.. code-block:: c++
+
+   MyCoroutineType some_func(Allocator&, int a) { ... }
+
+It *must* produce a value of type ``MyCoroutineType``.
+
+This is made possible by
+``promise_type::get_return_object_on_allocation_failure``. When ``operator new``
+returns ``nullptr``, C++ will invoke ``get_return_object_on_allocation_failure``
+to produce an "empty" ``MyCoroutineType``. Coroutine return types must therefore
+decide on a sentinel representation to use in order to indicate allocation
+failure.
+
+Fortunately, most coroutine return types are simple wrappers around
+``std::coroutine_handle<my_promise_type>``, which is nullable, so we can use the
+``nullptr`` representation to indicate allocation failure.
+
+The ``delete`` function is missing critical pieces. What now?
+=============================================================
+Just as we can control allocation with our custom ``promise_type::operator new``,
+we can control deallocation with ``promise_type::operator delete``!
+
+But unlike ``operator new``, ``delete`` cannot intercept function arguments,
+nor can it access properties of the coroutine frame. By the time ``delete``
+is called, the coroutine frame has already been destroyed. Other parts of
+C++ use the ``destroying_delete`` API to allow accessing an object as part
+of its deletion, but the coroutine ``promise_type`` doesn't include such an
+API.
+
+This means that ``delete`` has no way to get access to the ``Allocator``
+instance from which the memory was allocated. Pigweed's
+:cpp:class:`pw::Allocator` API does not provide a way to free memory from only
+a pointer to the allocated memory; a reference to the ``Allocator`` object is
+required (particular ``Allocator`` instances may support this, but the generic
+interface does not). :cpp:class:`pw::async2::Coro` solves this by specifying
+``operator delete`` to be a no-op. Instead, the coroutine memory is freed by
+the coroutine handle wrapper type (``Coro``) *after* running
+``operator delete``:
+
+.. literalinclude:: ../../pw_async2/public/pw_async2/coro.h
+   :language: cpp
+   :linenos:
+   :start-after: [pw_async2-coro-release]
+   :end-before: [pw_async2-coro-release]
+
+.. admonition:: Promise handle address
+
+   The standard doesn't appear to clarify whether or not this usage is
+   supported. I couldn't find any documentation specifying that the promise
+   handle address must point to the *start* of the allocated memory (rather
+   than, say, the allocated memory plus eight bytes or something).
+
+   In practice, this seems to work just fine: the Pigweed API works in both
+   GCC and Clang, even with sanitizers enabled.
+
+   In the future, a guarantee here would be super useful!
+
+----------------------------------------------------------------
+Still to go: improvements needed from C++ standard and compilers
+----------------------------------------------------------------
+One concern that may hamper adoption of coroutines in embedded applications is
+the total size of coroutine frames. While these frames are frequently much
+smaller than the stack sizes of threads, there may be many more of them.
+
+Especially for users who would otherwise write manual state-machine-based or
+callback-based code, the current size overhead of coroutines may be a limiting
+factor. Some opportunities for improvement here are discussed below.
+
+Reusing stack space of expired temporaries
+==========================================
+Currently, Clang does not use stack space of temporaries after their
+storage duration ends. This means that all variables and temporaries
+in a coroutine function must have dedicated storage space in the coroutine frame.
+This is unnecessary: many variables are not alive at the same time, and others
+may not ever be alive across a ``co_await`` point! In the former case, storage
+for these variables could be overlapped, and in the latter, they need not be
+stored in the coroutine frame at all.
+
+This results in having a coroutine frame of 88 bytes:
+
+.. code-block:: c++
+
+   Coro<Status> SimplestCoroWithAwait(CoroContext& cx) {
+     co_return co_await SimplestCoro(cx);
+   }
+
+And this code having a coroutine frame of 408 bytes:
+
+.. code-block:: c++
+
+   Coro<Status> SimplestCoroWithTwoAwaits(CoroContext& cx) {
+     co_await SimplestCoro(cx);
+     co_await SimplestCoro(cx);
+     co_await SimplestCoro(cx);
+     co_await SimplestCoro(cx);
+     co_await SimplestCoro(cx);
+     co_await SimplestCoro(cx);
+     co_await SimplestCoro(cx);
+     co_await SimplestCoro(cx);
+     co_await SimplestCoro(cx);
+     co_await SimplestCoro(cx);
+     co_return co_await SimplestCoro(cx);
+   }
+
+Despite both requiring the same amount of temporary storage.
+
+Discussion on this topic continues in
+`this LLVM issue regarding the missing lifetime markers on temporaries <https://github.com/llvm/llvm-project/issues/43598>`_.
+
+A similar issue in Rust was solved thanks to tmandry's efforts culminating
+in `this PR <https://github.com/rust-lang/rust/pull/60187>`_.
+
+Passing values in and out of coroutines
+=======================================
+Once a coroutine function has been invoked, it can only be advanced
+(and completed) via a series of calls to
+`resume <https://en.cppreference.com/w/cpp/coroutine/coroutine_handle/resume>`_.
+Annoyingly, ``resume`` does not accept any arguments and returns ``void``,
+so it does not offer a built-in way to pass arguments into our coroutine
+function, nor does it give us a way to return values from our coroutine function.
+
+:cpp:class:`pw::async2::Coro` solves this by storing a pointer to input
+and output storage inside the ``promise_type`` before invoking ``resume``.
+
+.. literalinclude:: ../../pw_async2/public/pw_async2/coro.h
+   :language: cpp
+   :linenos:
+   :start-after: [pw_async2-coro-resume]
+   :end-before: [pw_async2-coro-resume]
+
+This means that every coroutine frame must additionally include space for this
+input/output pointer. This wasted overhead seems fairly easy to eliminate if
+C++ evolved to allow ``promise_type`` to have associated ``resume_arg_type``
+and ``return_arg_type`` values.
+
+Avoiding dynamic allocation for statically known coroutines
+===========================================================
+Although the steps described in this blog will allow you to avoid heap
+allocation, they still rely on *dynamic* allocation via a
+:cpp:class:`pw::Allocator`. This means that our application can potentially
+fail at runtime by running out of memory for our coroutines.
+
+For code which only instantiates a single known coroutine or fixed set of
+known coroutines defined within the same translation unit, this is an
+unnecessary cost. Dynamic allocation creates both runtime inefficiency,
+developer uncertainty, and risk: it's no longer possible to statically ensure
+that your program will be able to run on-device.
+
+It would be fantastic if future C++ standards would make it possible to access
+the coroutine frame size of specific, known coroutines so that space for them
+could be statically allocated.
+
+.. admonition:: Heap allocation elision optimization (HALO)
+
+   `Heap allocation elision optimization <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0981r0.html>`_
+   can omit the calls to ``new`` and ``delete`` entirely in some cases.
+   However, this relies on the inlining of several independent functions
+   to the top-level scope which encompasses the full lifetime of the
+   coroutine. This isn't possible in the vast majority of the relevant
+   embedded use cases where it is common to store a coroutine as part of a
+   long-lived (often ``static``) object.
+
+   Furthermore, relying on such an optimization would be quite fragile, as
+   the compiler's choice to inline (or not) is dependent upon a variety of
+   heuristics outside the programmer's control.
+
+   Note, too, that when using custom allocator implementations one has
+   to take care or elision will not be possible: the allocation and
+   deallocation functions must be tagged with
+   `the appropriate attributes <https://discourse.llvm.org/t/optimize-away-memory-allocations/65587/4>`_
+   to tell the compiler that they are safe to optimize away.
+
+-----------------------------------
+Try out Pigweed's coroutine API now
+-----------------------------------
+To experiment with Pigweed's :cpp:class:`pw::async2::Coro` API, you can get
+started by `cloning our Quickstart repo <https://cs.opensource.google/pigweed/quickstart/bazel>`_
+for a bare-bones example
+or trying out the :ref:`Sense tutorial <showcase-sense-tutorial-intro>`
+for a full tour of a Pigweed product codebase.
diff --git a/docs/blog/index.rst b/docs/blog/index.rst
index cf2902c..d1a0ac2 100644
--- a/docs/blog/index.rst
+++ b/docs/blog/index.rst
@@ -19,4 +19,5 @@
    #2: Feature flags in Bazel <02-bazel-feature-flags>
    #3: Pigweed SDK launch <03-pigweed-sdk>
    #4: Fixed Point Arithmetic as a Replacement for Soft Floats <04-fixed-point>
+   #5: C++20 coroutines without heap allocation <05-coroutines>
    RSS feed <https://pigweed.dev/rss.xml>
diff --git a/docs/blog/rss.xml b/docs/blog/rss.xml
index 8c99db1..8b2c211 100644
--- a/docs/blog/rss.xml
+++ b/docs/blog/rss.xml
@@ -8,9 +8,20 @@
       share ideas and projects related to Pigweed.
     </description>
     <language>en-us</language>
-    <pubDate>Mon, 30 Sep 2024 13:47:00 -0700</pubDate>
+    <pubDate>Mon, 7 Oct 2024 08:00:00 -0700</pubDate>
     <atom:link href="https://pigweed.dev/rss.xml" rel="self" type="application/rss+xml"/>
     <item>
+      <title>C++20 coroutines without heap allocation</title>
+      <link>https://pigweed.dev/docs/blog/05-coroutines.html</link>
+      <description>
+        This blog post describes how you can use C++20 coroutines to create
+        embedded-friendly async code without writing manual state machines or
+        callbacks.
+      </description>
+      <pubDate>Mon, 7 Oct 2024 08:00:00 -0700</pubDate>
+      <guid>https://pigweed.dev/docs/blog/05-coroutines.html</guid>
+    </item>
+    <item>
       <title>Fixed-point arithmetic as a replacement for soft floats</title>
       <link>https://pigweed.dev/docs/blog/04-fixed-point.html</link>
       <description>
diff --git a/pw_async2/examples/BUILD.bazel b/pw_async2/examples/BUILD.bazel
index ab84f92..71895f7 100644
--- a/pw_async2/examples/BUILD.bazel
+++ b/pw_async2/examples/BUILD.bazel
@@ -32,6 +32,19 @@
 )
 
 pw_cc_test(
+    name = "coro_blinky_loop",
+    srcs = ["coro_blinky_loop.cc"],
+    deps = [
+        "//pw_allocator:testing",
+        "//pw_async2:coro",
+        "//pw_async2:dispatcher",
+        "//pw_async2:simulated_time_provider",
+        "//pw_async2:time_provider",
+        "//pw_chrono:system_clock",
+    ],
+)
+
+pw_cc_test(
     name = "once_send_recv",
     srcs = ["once_send_recv.cc"],
     deps = [
diff --git a/pw_async2/examples/BUILD.gn b/pw_async2/examples/BUILD.gn
index 9a48b5d..b06ef6f 100644
--- a/pw_async2/examples/BUILD.gn
+++ b/pw_async2/examples/BUILD.gn
@@ -23,19 +23,32 @@
     enable_if = pw_async2_DISPATCHER_BACKEND != ""
     deps = [
       "$dir_pw_allocator:testing",
-      "//pw_async2:coro",
-      "//pw_async2:dispatcher",
+      "$dir_pw_async2:coro",
+      "$dir_pw_async2:dispatcher",
     ]
     sources = [ "basic.cc" ]
   }
 
+  pw_test("coro_blinky_loop") {
+    enable_if = pw_async2_DISPATCHER_BACKEND != ""
+    sources = [ "coro_blinky_loop.cc" ]
+    deps = [
+      "$dir_pw_allocator:testing",
+      "$dir_pw_async2:coro",
+      "$dir_pw_async2:dispatcher",
+      "$dir_pw_async2:simulated_time_provider",
+      "$dir_pw_async2:time_provider",
+      "$dir_pw_chrono:system_clock",
+    ]
+  }
+
   pw_test("once_send_recv") {
     enable_if = pw_async2_DISPATCHER_BACKEND != ""
     deps = [
       "$dir_pw_allocator:testing",
-      "//pw_async2:coro",
-      "//pw_async2:dispatcher",
-      "//pw_async2:once_sender",
+      "$dir_pw_async2:coro",
+      "$dir_pw_async2:dispatcher",
+      "$dir_pw_async2:once_sender",
     ]
     sources = [ "once_send_recv.cc" ]
   }
@@ -46,6 +59,7 @@
   if (pw_toolchain_CXX_STANDARD >= pw_toolchain_STANDARD.CXX20) {
     tests += [
       ":basic",
+      ":coro_blinky_loop",
       ":once_send_recv",
     ]
   }
diff --git a/pw_async2/examples/CMakeLists.txt b/pw_async2/examples/CMakeLists.txt
index 9e2d73d..222eb69 100644
--- a/pw_async2/examples/CMakeLists.txt
+++ b/pw_async2/examples/CMakeLists.txt
@@ -23,6 +23,18 @@
       pw_async2.coro
   )
 
+  pw_add_test(pw_async2.examples.coro_blinky_loop
+    SOURCES
+      coro_blinky_loop.cc
+    PRIVATE_DEPS
+      pw_allocator.testing
+      pw_async2.coro
+      pw_async2.dispatcher
+      pw_async2.time_provider
+      pw_async2.simulated_time_provider
+      pw_chrono.system_clock
+  )
+
   pw_add_test(pw_async2.examples.once_send_recv
     SOURCES
       once_send_recv.cc
diff --git a/pw_async2/examples/basic.cc b/pw_async2/examples/basic.cc
index 948e792..5d79962 100644
--- a/pw_async2/examples/basic.cc
+++ b/pw_async2/examples/basic.cc
@@ -21,7 +21,6 @@
 using ::pw::OkStatus;
 using ::pw::Result;
 using ::pw::Status;
-using ::pw::allocator::Allocator;
 using ::pw::async2::Context;
 using ::pw::async2::Poll;
 
@@ -135,7 +134,6 @@
 using ::pw::OkStatus;
 using ::pw::Result;
 using ::pw::Status;
-using ::pw::allocator::Allocator;
 using ::pw::async2::Coro;
 using ::pw::async2::CoroContext;
 
@@ -145,18 +143,18 @@
 /// Create a coroutine which asynchronously receives a value from
 /// ``receiver`` and forwards it to ``sender``.
 ///
-/// Note: the ``Allocator`` argument is used by the ``Coro<T>`` internals to
+/// Note: the ``CoroContext`` argument is used by the ``Coro<T>`` internals to
 /// allocate the coroutine state. If this allocation fails, ``Coro<Status>``
 /// will return ``Status::Internal()``.
 Coro<Status> ReceiveAndSendCoro(CoroContext&,
                                 MyReceiver receiver,
                                 MySender sender) {
-  pw::Result<MyData> data = co_await receiver.Receive();
+  Result<MyData> data = co_await receiver.Receive();
   if (!data.ok()) {
     PW_LOG_ERROR("Receiving failed: %s", data.status().str());
     co_return Status::Unavailable();
   }
-  pw::Status sent = co_await sender.Send(std::move(*data));
+  Status sent = co_await sender.Send(std::move(*data));
   if (!sent.ok()) {
     PW_LOG_ERROR("Sending failed: %s", sent.str());
     co_return Status::Unavailable();
diff --git a/pw_async2/examples/coro_blinky_loop.cc b/pw_async2/examples/coro_blinky_loop.cc
new file mode 100644
index 0000000..b52081e
--- /dev/null
+++ b/pw_async2/examples/coro_blinky_loop.cc
@@ -0,0 +1,102 @@
+// Copyright 2024 The Pigweed Authors
+//
+// Licensed under the Apache License, Version 2.0 (the "License"); you may not
+// use this file except in compliance with the License. You may obtain a copy of
+// the License at
+//
+//     https://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+// WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+// License for the specific language governing permissions and limitations under
+// the License.
+
+#include "pw_async2/coro.h"
+#include "pw_async2/dispatcher.h"
+#include "pw_async2/simulated_time_provider.h"
+#include "pw_chrono/system_clock.h"
+#include "pw_result/result.h"
+#include "pw_status/status.h"
+
+namespace {
+
+using ::pw::OkStatus;
+using ::pw::Result;
+using ::pw::Status;
+using ::pw::async2::Coro;
+using ::pw::async2::CoroContext;
+using ::pw::async2::TimeProvider;
+using ::pw::chrono::SystemClock;
+using ::std::chrono_literals::operator""ms;
+
+class Led {
+ public:
+  void TurnOn() {}
+  void TurnOff() {}
+};
+
+// DOCSTAG: [pw_async2-examples-coro-blinky-loop]
+Coro<Status> Blink(CoroContext&,
+                   TimeProvider<SystemClock>& time,
+                   Led& led,
+                   int times) {
+  for (int i = 0; i < times; ++i) {
+    led.TurnOn();
+    co_await time.WaitFor(500ms);
+    led.TurnOff();
+    co_await time.WaitFor(500ms);
+  }
+  led.TurnOff();
+  co_return OkStatus();
+}
+// DOCSTAG: [pw_async2-examples-coro-blinky-loop]
+
+}  // namespace
+
+#include "pw_allocator/testing.h"
+
+namespace {
+
+using ::pw::OkStatus;
+using ::pw::allocator::test::AllocatorForTest;
+using ::pw::async2::Context;
+using ::pw::async2::Coro;
+using ::pw::async2::CoroContext;
+using ::pw::async2::Dispatcher;
+using ::pw::async2::Pending;
+using ::pw::async2::Poll;
+using ::pw::async2::Ready;
+using ::pw::async2::SimulatedTimeProvider;
+using ::pw::async2::Task;
+
+class ExpectCoroTask final : public Task {
+ public:
+  ExpectCoroTask(Coro<pw::Status>&& coro) : coro_(std::move(coro)) {}
+
+ private:
+  Poll<> DoPend(Context& cx) final {
+    Poll<Status> result = coro_.Pend(cx);
+    if (result.IsPending()) {
+      return Pending();
+    }
+    EXPECT_EQ(*result, OkStatus());
+    return Ready();
+  }
+  Coro<pw::Status> coro_;
+};
+
+TEST(CoroExample, ReturnsOk) {
+  AllocatorForTest<512> alloc;
+  CoroContext coro_cx(alloc);
+  SimulatedTimeProvider<SystemClock> time;
+  Led led;
+  ExpectCoroTask task = Blink(coro_cx, time, led, /*times=*/3);
+  Dispatcher dispatcher;
+  dispatcher.Post(task);
+  while (dispatcher.RunUntilStalled().IsPending()) {
+    time.AdvanceUntilNextExpiration();
+  }
+}
+
+}  // namespace
diff --git a/pw_async2/public/pw_async2/coro.h b/pw_async2/public/pw_async2/coro.h
index 160c461..e5ebf53 100644
--- a/pw_async2/public/pw_async2/coro.h
+++ b/pw_async2/public/pw_async2/coro.h
@@ -51,6 +51,31 @@
 
 void LogCoroAllocationFailure(size_t requested_size);
 
+template <typename T>
+class OptionalWrapper final {
+ public:
+  // Create an empty container for a to-be-provided value.
+  OptionalWrapper() : value_() {}
+
+  // Assign a value.
+  template <typename U>
+  OptionalWrapper& operator=(U&& value) {
+    value_ = std::forward<U>(value);
+    return *this;
+  }
+
+  // Retrieve the inner value.
+  //
+  // This operation will fail if no value was assigned.
+  operator T() {
+    PW_ASSERT(value_.has_value());
+    return *value_;
+  }
+
+ private:
+  std::optional<T> value_;
+};
+
 // A container for a to-be-produced value of type `T`.
 //
 // This is designed to allow avoiding the overhead of `std::optional` when
@@ -60,55 +85,10 @@
 // - a default-initialized `T` if `T` is default-initializable or
 // - `std::nullopt`
 template <typename T>
-class OptionalOrDefault final {
- public:
-  // Create an empty container for a to-be-provided value.
-  OptionalOrDefault() : value_() {}
-
-  // Assign a value.
-  template <typename U>
-  OptionalOrDefault& operator=(U&& value) {
-    value_ = std::forward<U>(value);
-    return *this;
-  }
-
-  // Retrieve the inner value.
-  //
-  // This operation will fail if no value was assigned.
-  T& operator*() {
-    PW_ASSERT(value_.has_value());
-    return *value_;
-  }
-
- private:
-  std::optional<T> value_;
-};
-
-// A specialization of `OptionalOrDefault<T>` for `default_initializable`
-// types.
-template <std::default_initializable T>
-class OptionalOrDefault<T> final {
- public:
-  // Create a container for a to-be-provided value by default-initializing.
-  OptionalOrDefault() : value_() {}
-
-  // Assign a value.
-  template <typename U>
-  OptionalOrDefault& operator=(U&& value) {
-    value_ = std::forward<U>(value);
-    return *this;
-  }
-
-  // Retrieve the inner value.
-  //
-  // This operation will return a default-constructed `T` if no value was
-  // assigned. Typical users should not rely on this, and should instead
-  // only retrieve values assigned using `operator=`.
-  T& operator*() { return value_; }
-
- private:
-  T value_;
-};
+using OptionalOrDefault =
+    std::conditional<std::is_default_constructible<T>::value,
+                     T,
+                     OptionalWrapper<T>>::type;
 
 // A wrapper for `std::coroutine_handle` that assumes unique ownership of the
 // underlying `PromiseType`.
@@ -173,6 +153,7 @@
   // Invokes `destroy()` on the underlying promise and deallocates its
   // associated storage.
   void Release() {
+    // DOCSTAG: [pw_async2-coro-release]
     void* address = promise_handle_.address();
     if (address != nullptr) {
       pw::allocator::Deallocator& dealloc = promise_handle_.promise().dealloc_;
@@ -180,6 +161,7 @@
       promise_handle_ = nullptr;
       dealloc.Deallocate(address);
     }
+    // DOCSTAG: [pw_async2-coro-release]
   }
 
  private:
@@ -478,6 +460,7 @@
         promise_handle_.promise().currently_pending_(cx).IsPending()) {
       return Pending();
     }
+    // DOCSTAG: [pw_async2-coro-resume]
     // Create the arguments (and output storage) for the coroutine.
     internal::InOut<T> in_out;
     internal::OptionalOrDefault<T> return_value;
@@ -498,7 +481,8 @@
 
     // When the coroutine completed in `resume()` above, it stored its
     // `co_return` value into `return_value`. This retrieves that value.
-    return std::move(*return_value);
+    return return_value;
+    // DOCSTAG: [pw_async2-coro-resume]
   }
 
   /// Used by the compiler in order to create a `Coro<T>` from a coroutine