Case study 2: Linux-x86-GCC

Introduction

In this section we will show an example of an assembly implementation of Boost.Coroutine low level context switching. While the example is x86 and Linux specific, it can easily generalized to other operating systems and CPUs. It is believed that the same code should work unmodified on BSDs derived systems.

Notice that the examples here will use the AT&T assembler syntax instead of the more common Intel syntax. There aren't many differences except that the order of source and destination operands are reversed and the opcode name encodes the length of the operands.

Initial code, `libc swapcontext` implementation

The exploration of a possible stack switching implementation has started from an analysis of the GNU glibc swapcontext implementation. We do not include the actual code here because of license issues, but will comment it. The actual code can be found in the file __swapcontextS_ of the glibc source archive.

swapcontext first load the address of the buffer where the context will be saved from the stack, where it has been pushed as part of the call setup. This buffer will be called the destination buffer.
Then movl is used to copy all general purpose register content to the destination buffer. For EAX a dummy value is saved because it will be clobbered by swapcontext.
The value of the instruction pointer a the time of the call to makecontext is saved in the destination buffer. The value of this register is retrieved from the stack, where it had been pushed by the call makecontext instruction.
The ESP stack pointer is saved in the destination buffer.
Then the FS segment register is saved in the destination buffer. Originally swapcontext also saved the GS register, but it has been found that this conflicted with threading`.
The floating point environment is saved with a call to fnstenv in the destination buffer. This includes the control word, status word, tag word, instruction pointer, data pointer and last opcode, but excludes the floating point register stack. This is about 28 bytes of data.
The address of the structure that will be restore is loaded from the stack. This will be called the source buffer. All above operations are reversed in LIFO order.
The current signal mask is saved in the destination buffer. The signal mask to be restored is loaded from the source buffer.
The sigprocmask system call is invoked to restore the signal mask.
The floating point environment is restored from the source buffer.
The GS register is restored from the source buffer.
The stack pointer is restored from the source buffer. This in practice switches stacks.
The return address (EIP) is restored from the source buffer and pushed in the stack.
All general purpose registers are restored from the source buffer.
ret is used to pop the instruction pointer and jump to it.

Optimizing `makecontext`

The above implementation suffer from various inefficiencies. The most glaring one is the call to sigprocmask that alone wastes thousands of cycles. Unfortunately the POSIX standard requires it. Boost.Coroutine does not deal with the signal mask and consider it as any shared resource. It is the responsibility of the user to guard against unsafe access to it. By simply removing the call the function can be sped up by three order of magnitude.

We can do better. Saving and restoring a segment register is an expensive operation, because requires not only the register content to be reloaded but also the segment descriptor entry from the segment table. The Linux operating system prohibits the user to change the FS register, thus we should be able to safely omit saving and restoring it.

We also do not need save the floating point environment. This should be considered shared state. This saves lots of cycles as it is an expensive operation too.

finally we do not need to save all general purpose registers. The Linux calling conventions state that EAX, ECX and ECX are callee clobbered and the caller should not expect these to be preserved. This is also true of the floating point stack that is required to be empty when calling a function (and in fact makecontext acknowledges this by not saving the floating pointer register stack).

`swapcontext_stack`

Here we will present the swapcontext implementation used by Boost.Coroutine on Linux x86 systems. Note that this implementation is not derived from glibc and has been independently developed. Also note that this is not a drop-in replacement for swapcontext.

The C++ prototype for this function is:

extern "C" void swapcontext_stack (void***, void**) throw()
__attribute((regparm(2)));

Where __attribute((regparm(2))) is a GCC extension to require pass by register parameters. The first parameter is a pointer to a pointer to the destination stack (here identified as an array of void pointers for simplicity), while the second is a pointer to the source stack. In practice the first is a pointer to the memory area where the destination stack pointer is stored and the second is the stack pointer that will be restored.

This is the body of swapcontext_stack

pushl %ebp      
        pushl %ebx      
        pushl %esi      
        pushl %edi      
        movl  %esp, (%eax)
        movl  %edx, %esp
        popl  %edi      
        popl  %esi      
        popl  %ebx              
        popl  %ebp
        ret

This function requires EAX to point to the destination stack pointer, while EDX is the new stack pointer. swapcontext_stack first saves all caller save registers on the old stack, then saves the stack pointer in the location pointed by EAX, then load EDX as the new stack pointer and restore the caller save registers from the new stack. The final ret will pop the return address and jump to it.

The amount of instructions in this implementation is close to optimal, also there are no register dependencies between them (all popl instructions depend on the ESP load, but substituting them with movl offset(%ecx) didn't increase performance).

Still this function is not optimal. The last ret will be always mispredicted by most CPUs. On NetBurst architectures (i.e. Pentimu 4 and derived) this implies an overhead of at least 25 cycles (but very often more than 50) to flush the pipeline. Considering an unrealistic worst case of one instruction per cycle for the previous function, the misprediction alone is more than two times the cycle count of swapcontext_stack itself.

Jump prediction

Before showing how swapcontext_stack can be further optimized we need to understand a little how branch prediction work on modern architectures.

Most CPUs have special circuitry to predict complex patterns of conditional jumps, but usually can only predict indirect jumps (i.e. jumps trough a pointer) to go to the location the same instruction jumped the last time (the CPU keeps a table that associates the address of a jump instructions with the addresses it jumped to the last time). Thus a jump that always go to the same place is always predicted, while a jump that alternates between two different targets is always mispredicted.

For example swapcontext_stack is used both to call a coroutine and to return from it. Consider a loop that repeatedly invokes a coroutine and return from it (for example a generator invoked by std::generate): the indirect call will be always mispredicted.

ret instructions are usually treated specially by CPUs, and instead of being predicted to jump where they jumped the last time, a return stack buffer is used to try to predict where the jump will return. When a call is made, the caller address is pushed in the return stack buffer and when a ret is performed, the address in the top of the stack is used to predict where the ret will go. This means that the ret in swapcontext_stack will be always mispredicted, because it will never jump to the caller of swapcontext_stack.

Finally, it seems that new generations of processors could have more advanced indirect branch prediction functionality. At least the Pentium M seems to be able to predict simple patterns of indirect jumps.

For reference see [Intel06] and [Fog06].

Optimizing the `ret`

We have seen that the first step to optimize swapcontext_stack is to substitute ret with a popl %ecx; jmp *%ecx pair. This gives the CPU a chance to predict the jump but is not enough. As a CPU will predict the jump to go where it did the last time, we need to have different jumps for each target. This is not obviously possible for dynamic code where at any point any coroutine could be invoked or a coroutine could yield to any other. But when the same coroutine is always called in a loop, the pattern is static and could be optimized. If we used two different jumps to invoke and yield from the coroutine, it will always be predicted. The simplest way to do that is to duplicate the code for swapcontest_stack in swapcontext_swap_up and swapcontext_swap_down`. The first is used for the invocation, the second for the yield. Other than that, the code is exactly the same. Measurements show a performance increase of at least 50% in the previous scenario.

In a dispatcher based scenario, the jump in swapcontext_stack_up will always be mispredicted, while the one in swapcontext_stack_down will always be predicted correctly to return to the dispatcher; thus, while the win is smaller, is sill better than mispredicting every time. This is why an "invoke + yield + invoke + yield" is not necessarily slower than "invoke + yield_to + yield".

Inlining `swapcontext`

If the compiler could inline swapcontext, we would have many more jumps and a much bigger chance of being predicted. Boost.Coroutine contains experimental code to do that, but is currently disabled because the inline assembler code used is not yet believed to be completely safe.

Handling exceptions

The code is believed to work correctly with exceptions on systems that use the zero overhead exception handling model (as do most GCC targets today). In this model there are no pointers to exception chains to be manipulated and restored on context switch.

Conclusions

We have seen one possible assembler implementation of swapcontext. While the code is very system specific, it could easily be ported on many more systems following a similar model. Also the analysis of the branch prediction functionality is by no mean limited to IA32 CPUs.