Home | Libraries | People | FAQ | More |
In this section we will show an example of an assembly implementation of Boost.Coroutine low level context switching. While the example is x86 and Linux specific, it can easily generalized to other operating systems and CPUs. It is believed that the same code should work unmodified on BSDs derived systems.
Notice that the examples here will use the AT&T
assembler syntax
instead of the more common Intel
syntax. There aren't many
differences except that the order of source and destination operands
are reversed and the opcode name encodes the length of the operands.
libc swapcontext
implementation
The exploration of a possible stack switching implementation has
started from an analysis of the GNU glibc swapcontext
implementation. We do not include the actual code here because of
license issues, but will comment it. The actual code can be found in
the file __swapcontextS_ of the glibc
source archive.
swapcontext
first load the address of the buffer where the context
will be saved from the stack, where it has been pushed as part of the
call setup. This buffer will be called the destination buffer.
movl
is used to copy all general purpose register content to
the destination buffer. For EAX
a dummy value is saved because it
will be clobbered by swapcontext
.
makecontext
is saved in the destination buffer. The value of this register is
retrieved from the stack, where it had been pushed by the call
makecontext
instruction.
ESP
stack pointer is saved in the destination buffer.
FS
segment register is saved in the destination
buffer. Originally swapcontext
also saved the GS
register, but it has been found that this
conflicted with threading`.
fnstenv
in
the destination buffer. This includes the control word, status word,
tag word, instruction pointer, data pointer and last opcode, but
excludes the floating point register stack. This is about 28 bytes of data.
LIFO
order.
sigprocmask
system call is invoked to restore the signal mask.
GS
register is restored from the source buffer.
EIP
) is restored from the source buffer and
pushed in the stack.
ret
is used to pop the instruction pointer and jump to it.
makecontext
The above implementation suffer from various inefficiencies. The most
glaring one is the call to sigprocmask
that alone wastes thousands of
cycles. Unfortunately the POSIX
standard requires
it. Boost.Coroutine does not deal with the signal mask and consider it
as any shared resource. It is the responsibility of the user to guard
against unsafe access to it.
By simply removing the call the function
can be sped up by three order of magnitude.
We can do
better. Saving and restoring a segment register is an expensive
operation, because requires not only the register content to be
reloaded but also the segment descriptor entry from the segment
table. The Linux operating system prohibits the user to change the
FS
register, thus we should be able to safely omit saving and
restoring it.
We also do not need save the floating point environment. This should be considered shared state. This saves lots of cycles as it is an expensive operation too.
finally we do not need to save all general purpose registers. The
Linux calling conventions state that EAX, ECX
and ECX
are callee
clobbered and the caller should not expect these to be preserved. This
is also true of the floating point stack that is required to be empty
when calling a function (and in fact makecontext
acknowledges this
by not saving the floating pointer register stack).
swapcontext_stack
Here we will present the swapcontext
implementation used by
Boost.Coroutine on Linux x86 systems. Note that this implementation is
not derived from glibc
and has been independently developed. Also
note that this is not a drop-in replacement for swapcontext
.
The C++
prototype for this function is:
extern "C" void swapcontext_stack (void***, void**) throw()
__attribute((regparm(2)));
Where __attribute((regparm(2)))
is a GCC
extension to require pass
by register parameters. The first parameter is a pointer to a pointer
to the destination stack (here identified as an array of void pointers for
simplicity), while the second is a pointer to the source stack. In
practice the first is a pointer to the memory area where the
destination stack pointer is stored and the second is the stack
pointer that will be restored.
This is the body of swapcontext_stack
pushl %ebp
pushl %ebx
pushl %esi
pushl %edi
movl %esp, (%eax)
movl %edx, %esp
popl %edi
popl %esi
popl %ebx
popl %ebp
ret
This function requires EAX
to point to the destination stack
pointer, while EDX
is the new stack pointer. swapcontext_stack
first saves all caller save registers on the old stack, then saves the
stack pointer in the location pointed by EAX
, then load EDX
as the
new stack pointer and restore the caller save registers from the new
stack. The final ret
will pop the return address and jump to it.
The amount of instructions in this implementation is close to optimal,
also there are no register dependencies between them (all popl
instructions depend on the ESP
load, but substituting them with movl
offset(%ecx)
didn't increase performance).
Still this function is not optimal. The last ret
will be always
mispredicted by most CPUs. On NetBurst
architectures (i.e. Pentimu 4
and derived) this implies an overhead of at least 25 cycles (but very
often more than 50) to flush the pipeline.
Considering an unrealistic worst case of one instruction
per cycle for the previous function, the misprediction alone is more
than two times the cycle count of swapcontext_stack
itself.
Before showing how swapcontext_stack
can be further optimized we
need to understand a little how branch prediction work on modern
architectures.
Most CPUs have special circuitry to predict complex patterns of conditional jumps, but usually can only predict indirect jumps (i.e. jumps trough a pointer) to go to the location the same instruction jumped the last time (the CPU keeps a table that associates the address of a jump instructions with the addresses it jumped to the last time). Thus a jump that always go to the same place is always predicted, while a jump that alternates between two different targets is always mispredicted.
For example swapcontext_stack
is used both to call a coroutine and
to return from it. Consider a loop that repeatedly invokes a coroutine
and return from it (for example a generator invoked by
std::generate
): the indirect call will be always mispredicted.
ret
instructions are usually treated specially by CPUs, and instead
of being predicted to jump where they jumped the last time, a return
stack buffer is used to try to predict where the jump will
return. When a call is made, the caller address is pushed in the
return stack buffer and when a ret
is performed, the address in the
top of the stack is used to predict where the ret
will go. This
means that the ret
in swapcontext_stack
will be always
mispredicted, because it will never jump to the caller of
swapcontext_stack
.
Finally, it seems that new generations of processors could have more
advanced indirect branch prediction functionality. At least the
Pentium M
seems to be able to predict simple patterns of indirect
jumps.
For reference see [Intel06] and [Fog06].
ret
We have seen that the first step to optimize swapcontext_stack
is to
substitute ret
with a popl %ecx; jmp *%ecx
pair. This gives the
CPU a chance to predict the jump but is not enough. As a CPU will
predict the jump to go where it did the last time, we need to have
different jumps for each target. This is not obviously possible for
dynamic code where at any point any coroutine could be invoked or a
coroutine could yield to any other. But when the same coroutine is
always called in a loop, the pattern is static and could be
optimized. If we used two different jumps to invoke and yield from the
coroutine, it will always be predicted. The simplest way to do that is
to duplicate the code for swapcontest_stack
in swapcontext_swap_up
and swapcontext_swap_down`. The first is used for the invocation, the
second for the yield. Other than that, the code is exactly the
same. Measurements show a performance increase of at least 50% in the
previous scenario.
In a dispatcher based scenario, the jump in swapcontext_stack_up
will
always be mispredicted, while the one in swapcontext_stack_down
will
always be predicted correctly to return to the dispatcher; thus, while
the win is smaller, is sill better than mispredicting every time. This
is why an "invoke + yield + invoke + yield" is not necessarily slower
than "invoke + yield_to + yield".
swapcontext
If the compiler could inline swapcontext
, we would have many more
jumps and a much bigger chance of being predicted. Boost.Coroutine
contains experimental code to do that, but is currently disabled
because the inline assembler code used is not yet believed to be
completely safe.
The code is believed to work correctly with exceptions on systems that use the zero overhead exception handling model (as do most GCC targets today). In this model there are no pointers to exception chains to be manipulated and restored on context switch.
We have seen one possible assembler implementation of
swapcontext
. While the code is very system specific, it could easily
be ported on many more systems following a similar model. Also the
analysis of the branch prediction functionality is by no mean limited
to IA32
CPUs.
Copyright © 2006 Giovanni P. Deretta |