Interaction between coroutines and threads

Introduction

On the planning stage of the library, it was believed that being able to migrate a coroutine from one thread to another was a desirable property, and even a necessary one to take full advantage of the completion port abstraction provided by the Win32 API. During the implementation stage it became apparent that guaranteeing this property was going to be a considerable challenge.

In the end the decision to prohibit migration as been taken. This section shows why it is unfeasible with current compilers/standard libraries to allow coroutine migration.

The problems

One of the problems with migrating coroutines is the handling of thread local storage. If such an object is accessed, the thread specific copy is acceded instead. Consider the following code (it is plain C to simplify the generated assembler output, but is by no mean restricted to it):

__thread int some_val;

void bar();

int foo () {
  while(1) {
    bar();
    printf("%p", &test);
  }
}

The __thread storage class is a GCC extension to mark a global object as having thread specific storage. Most compilers that support threaded applications have similar facilities albeit with slightly different syntaxes. Let suppose that every time bar() is invoked, foo() is suspended and then resumed in another thread. We would expect that at every iteration printf() will print a different address for test, as every thread has its own specific instance. For this function GCC generates the current assembler output (non relevant parts have been omitted):

.L2
        call    bar
        movl    %gs:0, %eax
        leal    test@NTPOFF(%eax), %eax
        pushl   %eax
        pushl   $.LC0
        call    printf
        popl    %eax
        popl    %edx
        jmp     .L2

This is straightforward. The first line calls bar, the second line loads from the thread register (GCC uses the GS segment register as a thread register) the address of the TLS area, then the third line load the address of the current thread instance of test in EAX. The fourth and fifth line push on the stack the parameters for printf (#.LCO is the symbol that contains the string "%p"). The sixth line calls it. The seventh and eight line pop the argument from the stack and finally the last line returns to the first.

This code does the right thing at every iteration print the a new value for the address of test. If we compile at an higher optimization level things are no longer fine:

movl    %gs:0, %eax
        leal    test@NTPOFF(%eax), %ebx
.L2:
        call    bar
        pushl   %ebx
        pushl   $.LC0
        call    printf
	popl    %ebx
        popl    %edx
        jmp     .L2

Even on an optimization level as low as -O1 (usually considered safe), the compiler hoists the load of the address of test outside the loop. Now the loop will always print the same value.

Unfortunately this specific compiler provides no switch to disable this specific optimization. Other compilers might do the same thing. The only compiler we know that provides a switch to explicitly disable this optimization is Visual C++, as this is often used with code that uses fibers.

It might be argued that #__thread# is not part of the

It might be argued that __thread is not part of the C++ standard, so its handling is undefined anyway. Putting aside the fact that something similar to __thread is likely to be part of the next release of the standard, abstaining from using it is not a solution. For example on many systems the errno macro expands to a symbol declared the equivalent of __thread. Also thread local variables might be used in standard library facilities (memory allocation is a very likely candidate), and an optimizer capable of inlining library functions might hoist loads of those variables outside loops or at least move them across yield points.

Fixing compilers is unfortunately not enough. Operating systems might need to be fixed too; consider the following code:

mutex mtx;

void bar();

void foo() {
  lock(mtx);
  bar();
  unlock(mtx);
}

Where mutex is some synchronization primitive, and bar() a function may migrate the current coroutine to another thread. Aside of the fact that is bad practice to hold a lock across a yield point, many operating systems require a mutex to be unlocked by the same thread that locked it, breaking the code above.

Conclusion

The above scenarios are just two examples. There are many possible ways that coroutine migration could break otherwise perfectly fine code. For reference see this blog about using fibers in .NET code and MSDN article about the perils of fiber mode in SQL Server.

In the end Boost.Coroutine provides the only thread safety guarantees that are believed to be safe on all systems. Note that, as coroutines are not to be shared between threads, internal reference counting is not thread safe (it doesn't necessarily use atomic operations).