In this section we will show an example of an assembly implementation of Boost.Coroutine low level context switching. While the example is x86 and Linux specific, it can easily generalized to other operating systems and CPUs. It is believed that the same code should work unmodified on BSDs derived systems.
Notice that the examples here will use the
AT&T assembler syntax
instead of the more common
Intel syntax. There aren't many
differences except that the order of source and destination operands
are reversed and the opcode name encodes the length of the operands.
The exploration of a possible stack switching implementation has
started from an analysis of the
GNU glibc swapcontext
implementation. We do not include the actual code here because of
license issues, but will comment it. The actual code can be found in
the file __swapcontextS_ of the
glibc source archive.
swapcontextfirst load the address of the buffer where the context will be saved from the stack, where it has been pushed as part of the call setup. This buffer will be called the destination buffer.
movlis used to copy all general purpose register content to the destination buffer. For
EAXa dummy value is saved because it will be clobbered by
makecontextis saved in the destination buffer. The value of this register is retrieved from the stack, where it had been pushed by the
ESPstack pointer is saved in the destination buffer.
FSsegment register is saved in the destination buffer. Originally
swapcontextalso saved the
GSregister, but it has been found that this conflicted with threading`.
fnstenvin the destination buffer. This includes the control word, status word, tag word, instruction pointer, data pointer and last opcode, but excludes the floating point register stack. This is about 28 bytes of data.
sigprocmasksystem call is invoked to restore the signal mask.
GSregister is restored from the source buffer.
EIP) is restored from the source buffer and pushed in the stack.
retis used to pop the instruction pointer and jump to it.
The above implementation suffer from various inefficiencies. The most
glaring one is the call to
sigprocmask that alone wastes thousands of
cycles. Unfortunately the
POSIX standard requires
it. Boost.Coroutine does not deal with the signal mask and consider it
as any shared resource. It is the responsibility of the user to guard
against unsafe access to it.
By simply removing the call the function
can be sped up by three order of magnitude.
We can do
better. Saving and restoring a segment register is an expensive
operation, because requires not only the register content to be
reloaded but also the segment descriptor entry from the segment
table. The Linux operating system prohibits the user to change the
FS register, thus we should be able to safely omit saving and
We also do not need save the floating point environment. This should be considered shared state. This saves lots of cycles as it is an expensive operation too.
finally we do not need to save all general purpose registers. The
Linux calling conventions state that
EAX, ECX and
ECX are callee
clobbered and the caller should not expect these to be preserved. This
is also true of the floating point stack that is required to be empty
when calling a function (and in fact
makecontext acknowledges this
by not saving the floating pointer register stack).
Here we will present the
swapcontext implementation used by
Boost.Coroutine on Linux x86 systems. Note that this implementation is
not derived from
glibc and has been independently developed. Also
note that this is not a drop-in replacement for
C++ prototype for this function is:
extern "C" void swapcontext_stack (void***, void**) throw() __attribute((regparm(2)));
__attribute((regparm(2))) is a
GCC extension to require pass
by register parameters. The first parameter is a pointer to a pointer
to the destination stack (here identified as an array of void pointers for
simplicity), while the second is a pointer to the source stack. In
practice the first is a pointer to the memory area where the
destination stack pointer is stored and the second is the stack
pointer that will be restored.
This is the body of
pushl %ebp pushl %ebx pushl %esi pushl %edi movl %esp, (%eax) movl %edx, %esp popl %edi popl %esi popl %ebx popl %ebp ret
This function requires
EAX to point to the destination stack
EDX is the new stack pointer.
first saves all caller save registers on the old stack, then saves the
stack pointer in the location pointed by
EAX, then load
EDX as the
new stack pointer and restore the caller save registers from the new
stack. The final
ret will pop the return address and jump to it.
The amount of instructions in this implementation is close to optimal,
also there are no register dependencies between them (all
instructions depend on the
ESP load, but substituting them with
offset(%ecx) didn't increase performance).
Still this function is not optimal. The last
ret will be always
mispredicted by most CPUs. On
NetBurst architectures (i.e. Pentimu 4
and derived) this implies an overhead of at least 25 cycles (but very
often more than 50) to flush the pipeline.
Considering an unrealistic worst case of one instruction
per cycle for the previous function, the misprediction alone is more
than two times the cycle count of
Before showing how
swapcontext_stack can be further optimized we
need to understand a little how branch prediction work on modern
Most CPUs have special circuitry to predict complex patterns of conditional jumps, but usually can only predict indirect jumps (i.e. jumps trough a pointer) to go to the location the same instruction jumped the last time (the CPU keeps a table that associates the address of a jump instructions with the addresses it jumped to the last time). Thus a jump that always go to the same place is always predicted, while a jump that alternates between two different targets is always mispredicted.
swapcontext_stack is used both to call a coroutine and
to return from it. Consider a loop that repeatedly invokes a coroutine
and return from it (for example a generator invoked by
std::generate): the indirect call will be always mispredicted.
ret instructions are usually treated specially by CPUs, and instead
of being predicted to jump where they jumped the last time, a return
stack buffer is used to try to predict where the jump will
return. When a call is made, the caller address is pushed in the
return stack buffer and when a
ret is performed, the address in the
top of the stack is used to predict where the
ret will go. This
means that the
swapcontext_stack will be always
mispredicted, because it will never jump to the caller of
Finally, it seems that new generations of processors could have more
advanced indirect branch prediction functionality. At least the
Pentium M seems to be able to predict simple patterns of indirect
We have seen that the first step to optimize
swapcontext_stack is to
ret with a
popl %ecx; jmp *%ecx pair. This gives the
CPU a chance to predict the jump but is not enough. As a CPU will
predict the jump to go where it did the last time, we need to have
different jumps for each target. This is not obviously possible for
dynamic code where at any point any coroutine could be invoked or a
coroutine could yield to any other. But when the same coroutine is
always called in a loop, the pattern is static and could be
optimized. If we used two different jumps to invoke and yield from the
coroutine, it will always be predicted. The simplest way to do that is
to duplicate the code for
and swapcontext_swap_down`. The first is used for the invocation, the
second for the yield. Other than that, the code is exactly the
same. Measurements show a performance increase of at least 50% in the
In a dispatcher based scenario, the jump in
always be mispredicted, while the one in
always be predicted correctly to return to the dispatcher; thus, while
the win is smaller, is sill better than mispredicting every time. This
is why an "invoke + yield + invoke + yield" is not necessarily slower
than "invoke + yield_to + yield".
If the compiler could inline
swapcontext, we would have many more
jumps and a much bigger chance of being predicted. Boost.Coroutine
contains experimental code to do that, but is currently disabled
because the inline assembler code used is not yet believed to be
The code is believed to work correctly with exceptions on systems that use the zero overhead exception handling model (as do most GCC targets today). In this model there are no pointers to exception chains to be manipulated and restored on context switch.
We have seen one possible assembler implementation of
swapcontext. While the code is very system specific, it could easily
be ported on many more systems following a similar model. Also the
analysis of the branch prediction functionality is by no mean limited
|Copyright © 2006 Giovanni P. Deretta|