Implementation

A simple trick, the Duff device

It is possible to implement a very restricted set of coroutine functionality entirely within the language rules, using a variation of the duff device. See for exampleCoroutines in C. While this is a cunning hack, it has some severe limitations that makes it unsuitable for a general purpose library like Boost.Coroutine:

Does only support semi-coroutines: cannot yield form inside nested functions.
There can only be one instance of a coroutine for each coroutine body.
Local variables are static. This means that they are not destroyed when exceptions are thrown. In general the use of statics as locals is questionable.

The stack switching model

Boost.Coroutine is implemented around a stack switching model; that is, every time a coroutine is entered, the following actions are carried:

The caller instruction pointer and callee clobbered registers are saved on the caller stack (this is done automatically as part of the call to the context switching routine respectively by the compiler and by the CPU).
the set of callee save registers are saved by the context switching routine on the caller stack.
The caller stack pointer is saved inside the caller context structure. This structure is stored on the coroutine context if a yield_to or yield is being performed, else it is also on the caller stack.
The callee stack pointer is retrieved from the callee context structure and set as current stack pointer.
Callee save registers are popped from the callee stack.
The context switching routine returns, restoring automatically from the new stack the called coroutine instruction pointer and clobbered registers.

This process is inherently non portable, requires intimate knowledge of the target CPU and environment, and some assembly code. Also it assumes the existence of both a stack and a real CPU, neither of witch are required by the standard.

In practice the library should be portable to most platforms that have a C++ implementation, with only minimal changes. CPUs with registers windows shouldn't be a problem, nor should be systems with multiple stack. For example HP provides assembler source for coroutine stack switching for the Itanium architecture, that has both registers windows and multiple stacks [Saboff03].

Implementing Boost.Coroutine on an C++ interpreter would require support form the interpreter.

If all else fail, there is still the possibility of implementing Boost.Coroutine in top of threads, albeit at a very high performance penalty.

The myth of a portable coroutine library

Writing a portable coroutine library has been the subject of numerous studies. A part from the limited Duff device-based trick, the most promising has been the stack copying model. The lower and highest addresses of the current stack are calculated by taking the address of appropriately placed local objects. Then, when the stack switching is required, the current stack is copied to some backup location and from another location a target stack is copied in place of the old stack. The actual context switch is performed using setjmp and longjmp. As an optimization, the address of the stack pointer is found, in a non portable manner, in the jmp_buf and the stack is switched by modifying this pointer instead of performing an expensive copy.

This trick has meet a moderate success in the C world, even if in practice libraries using this trick need to take special per-platform actions (like working around standard libraries bugs, eager optimizers or simply identifying the position of the stack pointer in jmp_buf). Notice that nowhere the C standard guarantees that the stack can be switched with impunity. In fact, it doesn't even guarantees that a stack exists at all (so does the C++ standard).

In the C++ world things are complicated by the fact that the standard permits an implementation of longjmp to unwind the stack and call destructors. Usually the compiler determines how much to unwind the stack by comparing stack pointers. Modifying this pointer will wreak havoc. While most implementations do not take advantage of this possibilities, at least the OpenMP (that by no mean is restricted to this platform: for example is supported by the GCC compiler on most platforms it runs on) requires that. Also the _LLVM_ compiler provides a __builtin_longjmp builtin that is documented to unwind the stack (even if a standard library is not required to implement longjmp in term of this builtin, it is likely to do so).

Also some compilers and some operating systems requires some work to switch exceptions context. This is true at least on Win32 and on GCC setjmp/longjmp based exception handling. This is very system specific and a library that doesn't take this into account is likely to be defective.

Finally the interaction between coroutines and threads must be taken into account and any incompatibilities must be documented or fixed.

In conclusion, the backend of a C++ coroutine library cannot be oblivious of the system it runs on, thus it might as well be system specific. Similar platforms common code may still be shared.

Currently the library uses fibers on Win32 systems, custom assembler code on the very specific "GCC-linux-x86 using frame-unwind-tables based exception handling and a generic makecontext/setcontext based implementation on POSIX 2001 compliant systems. Note that, as the POSIX standard knows nothing about C++, this is not guaranteed to work on all platforms. Also the interaction between threads and user contexts is not specified by the standard.

Extensibility

The library has been designed to be very easily ported to other environments. All classes that need access to the non portable context switching code have a (currently undocumented) extra template parameter that permit the selection of the context switching implementation at compile time. Thus is technically possible to mix different context implementations in the same library.

The context switching support is very simple and tightly contained. It only requires a context structure and two function for context setup and context swap (the interface is modeled around the POSIX makecontext API). At this time the actual extension interface is not documented and a private detail, because it is likely to be further simplified (the current requirements of the Context concept have been complicated to explore potential performance optimizations).

A wild dream: Compiler support for coroutines

The stack switching model, while good enough for generalized cooperative multitasking, prevents some useful compiler optimizations. For example a compiler cannot optimize across a coroutine call, nor can inline the coroutine in the caller. This is expected in when the target is not statically known, for example in an event dispatching loop, but in usages where the control flow can be statically determined is a loss. For example generators function objects can usually be inlined while coroutine based function objects cannot.

A compiler that knows about coroutines could apply the same optimizations to coroutine based code, as long as a context switch target is known. This should not require much more power than the ability to inline function pointers and convert a coroutine body to callback based code. Even dynamic coroutine code could be rewritten it to callback based code, but in this case an indirect jump is required anyways and is not necessarily a win.

Compilers capable of inlining coroutines already exist. For example, compilers for languages with support for continuations often transform the code to be complied in the so called continuation passing style. Coroutines in these languages can be trivially implemented as continuations. These compilers can then optimize the code in CPS form, potentially inlining some continuation calls. Thus potentially coroutines are optimized too.

Unfortunately the CPS form is not suitable for C and C++ because it does not match well the execution model of these languages and even if possible it could impose some overhead.

The C# 2.0 language requires compilers to be capable of transforming semi-coroutine based code to callback based code. These compilers are not much different from C++ compilers. While the limitation to optimize semi-coroutine can seem major, in practice a coroutine can be converted to a semi-coroutine if all nested functions that call yield can be inlined in the coroutine itself. A compiler could fall back on stack switching if it cannot inline a yield (or it doesn't know at all if a function can yield).