Really excited to see how virtual threads are taken up by developers, and if the...

tomp · on March 22, 2023

The fact that they’re cooperatively scheduled makes them quite problematic.

Basically a leaky abstraction, you have to think whether your code will block or not.

Go went through the same process, before eventually figuring out that they do need to make their lightweight threads (goroutines) preemptive.

cozos · on March 22, 2023

Can you expand why its a leaky abstraction? Is it because virtual threads can only yield when its doing blocking operations?

tomp · on March 22, 2023

Yes.

When you're in a "tight loop" (e.g. a matrix multiplication, which is basically 3 nested loops that only load data, do math, write data), Java's virtual threads just won't yield. So if you write your app in the "wrong" way, you lose concurrency.

There's a lot of discussion about this from the Go side. The original issue was this one: runtime: tight loops should be preemptible https://github.com/golang/go/issues/10958

> it's possible to write a tight loop (e.g., a numerical kernel or a spin on an atomic) with no calls or allocation that arbitrarily delays preemption. This can result in arbitrarily long pause times as the GC waits for all goroutines to stop.

The proposed (and ultimately accepted, AFAIK) solution is described here: https://github.com/golang/go/issues/24543

> has put significant effort into prototyping cooperative preemption points in loops, which is one way to solve this problem. However, even sophisticated approaches to this led to unacceptable slow-downs in tight loops (where slow-downs are generally least acceptable).

> I propose that the Go implementation switch to non-cooperative preemption using stack and register maps at (essentially) every instruction. This would allow goroutines to be preempted without explicit preemption checks. This approach will solve the problem of delayed preemption with zero run-time overhead and have side benefits for debugger function calls

I 100% expect Java will have to do through the same evolution. But first they'll probably try to deny reality for a few years. Funny enough, same as has happened with Go and generics.

BlackFly · on March 22, 2023

I doubt it, since they already did that journey long ago and are now adding virtual threads next to ordinary threads that replaced the original green threads. If you put your long running work into the virtual kind of threadpool, a timing sentinel can easily warn you and after noticing that you easily use the normal threadpool instead.

VertX framework had such a sentinel, but migrating code from the async futures to a normal threadpool can be a bit tedious if your design is poor.

cesarb · on March 21, 2023

I'm personally not so sure. We already had something like that before (M:N threads), and the world moved away from it, towards letting the kernel manage everything (1:1 threads). So I'd expect instead that operating system kernels gain whatever features are missing for scaling to a higher number of threads, and everything once again goes back to each programming language thread corresponding to one kernel thread.

mike_hearn · on March 21, 2023

The problem of scaling threads up further is fundamental and not really solvable by more kernel features. JVM virtual threads can be efficient because the runtime has complete knowledge of the executing code and stack layouts, how the heap is laid out, how the GC works and it can control how code is compiled. The kernel can't do any of these things - it has to assume a process is a black box that could do anything with its stacks, could be compiled by anything and so on.

Note that this advantage obviously goes away the moment you call into native code. Then the JVM is in the same position as the kernel. It doesn't control the compiler or the stack any more, and so that's why a virtual thread becomes "pinned" at that point and you lose the efficiency (the JVM needs to acquire more kernel threads). Fortunately though the JVM ecosystem doesn't rely on native code all that much, so it should be rare in practice.

This advantage can be brought to other non-Java languages too via Truffle. Truffle languages are reimplemented on top of Java and when programs call into native code they have the option of calling into JIT compiled LLVM bitcode instead of real native code (or indeed any JVM bytecode library). In that situation the JVM remains in control and so things should in theory still be Loom-able. Not sure if that's currently true in practice, but it could be.

cesarb · on March 22, 2023

> JVM virtual threads can be efficient because the runtime has complete knowledge of the executing code and stack layouts, how the heap is laid out, how the GC works and it can control how code is compiled.

Forgive me for staying doubtful, but I recall hearing this same "the JVM can be very fast and efficient because its JIT has complete knowledge and control" spiel back in the 90s, and back then, anyone could clearly see that the JVM was not as fast compared to pre-compiled native code as it was being promised.

> The kernel can't do any of these things - it has to assume a process is a black box that could do anything with its stacks, could be compiled by anything and so on.

The kernel has to assume nothing; it can dictate how userspace processes behave. As an example, a process which plays too many games with its stacks, without kernel cooperation, will quickly find out that signals share the same stack unless the kernel is told to use an alternate stack. A process which uses a register declared in the platform ABI as being for kernel use will find out that it can be unpredictably overwritten on a context switch. There are things like shadow stacks and segment register bases which can only be manipulated when the kernel allows it. And so on.

Of course, for compatibility reasons, the current ABI allows userspace processes to do a lot of unpredictable things, but nothing prevents a new "highly scalable threads" process ABI, with stricter rules, from being developed if necessary. Or it could be that only a few cooperative additions to the userspace to kernel ABI are necessary; we already have things like the many options to the clone() system calls, the futex system call, restartable sequences, etc.

mike_hearn · on March 22, 2023

> the JVM was not as fast compared to pre-compiled native code as it was being promised

Well head-for-head Java will still lose to C++ in many benchmarks, but that's not really due to compiled code quality, it's more about language semantics. Java is very fast for the sort of language it currently is. The big wins for C++ are that Java doesn't have value types or support for vector operations. Both are under development, actually vector ops is basically done but it's waiting for support for value types (see discussion elsewhere).

Also GCd languages trend towards a functional style without much in-place mutation, whereas C++ trends in the opposite direction, so C++ will sometimes use the CPU cache more effectively just due to prevailing habits amongst programmers.

> The kernel has to assume nothing; it can dictate how userspace processes behave.

Yes in theory you could fuse the language VM with the kernel and research operating systems like MSR Singularity did that. But a normal kernel like NT, Linux or Darwin can't do this and not only for backwards compatibility. The JVM will do things like move a virtual thread stack back and forth from the garbage collected heap and do so on the fly. Unless the kernel contains a JIT compiler, GC and injects lots of runtime code into the app's process it's going to find it tricky to do the same. By the time you've done the same you haven't implemented better kernel threads, you've made the JVM run in the kernel.

cozos · on March 22, 2023

It's been a while since I read up on this, but my understanding is that with OS threads, during a context switch it has to pop the entire process stack, which in Java is 1MB by default. This is expensive. Virtual threads "context switches" have much more lightweight stacks because the JVM knows exactly what kind of state needs to be associated with the virtual thread and thats where the difference lies.

re-thc · on March 22, 2023

The JVM is comparable to pre-compiled native code these days. It has evolved and improved since the 90s.

kaba0 · on March 22, 2023

But the kernel doesn’t know the semantics — can that blocking call be turned into a non-blocking, or is blocking the wanted functionality here?

kaba0 · on March 21, 2023

M:N is not the interesting aspect of virtual threads at all, automagically turning blocking operations into non-blocking is - which has not really been tried before (with erlang and go being the first).

cryptonector · on March 21, 2023

GNU Pth had "automagically turning blocking operations into non-blocking" ages ago, and it wasn't the first.

I think what you probably had in mind is that C libraries of the 90s that did M:N threading didn't turn blocking operations into non-blocking?

Using blocking operations to switch contexts is really nothing new. Heck, the cooperative multi-tasking systems of the 80s (Mac, Amiga) all essentially did that for processes (not threads), and so did Unix in the 70s.

kaba0 · on March 21, 2023

> I think what you probably had in mind is that C libraries of the 90s that did M:N threading didn't turn blocking operations into non-blocking

Yes, mostly, though my history knowledge is definitely lacking so do correct me if I’m wrong.

But you are right, there was nothing fundamentally missing, probably just no good OS support for non-blocking IO calls in the early days? Though probably the IO-CPU ratio was also different, so the benefits were not as big?

cryptonector · on March 22, 2023

There were bad experiences with the M:N threading of the 90s in Solaris' and others' C libraries. Making those libraries make every file descriptor non-blocking behind the programmer's back was a tricky thing. Think about inheritance of file descriptors via fork() and exec() -- you could have one threaded process sharing an FD with a non-threaded process, and now even non-threaded processes' C library would have to poll(), and now add static linking with older C libraries to the mix and it just couldn't be done. So it wasn't done.

cryptonector · on March 22, 2023

Which makes me wonder why this can be done in Java or Erlang, and the answer is that those tend to be walled gardens from which one does not fork/exec.

ghostwriter · on March 21, 2023

> automagically turning blocking operations into non-blocking is - which has not really been tried before (with erlang and go being the first).

sorry, but Haskell has had it way before Go, with proper STM too. Neither Erlang nor Go are offering the same level of ergonomics for compile-time checked M:N threading.

severino · on March 22, 2023

> M:N is not the interesting aspect of virtual threads at all, automagically turning blocking operations into non-blocking is

I'm not very into this Loom virtual threads thing, but... what's the difference between this automagically conversion of blocking into non-blocking in a M:N model and a 1:1 one? I mean, couldn't the same be done with normal threads too?

kaba0 · on March 22, 2023

Well, to a degree this is also done by the OS, IO syscalls are frequent locations where the OS scheduler might decide to schedule another thread, but this is a very slow context switch (flushing caches, including TLB, the switch to kernel mode and back, and since heartbleed and alia it is even more expensive).

Loom implements every IO on top of a more modern async OS calls, and these virtual thread context switches are on the order of function calls, so the overhead and number of switches that can happen are much much lower.

severino · on March 22, 2023

Thanks!

pkolaczk · on March 21, 2023

> low resource usage of async/await

Will calling a coroutine do zero heap allocations like async in Rust?

> with the ease-of-use experience of threads

That's highly subjective. Threads usually require locking which is often hard to get performant and correct at the same time. Async/await allows to write concurrent code with no synchronization.

mike_hearn · on March 21, 2023

> Async/await allows to write concurrent code with no synchronization

Well you still need some sort of synchronization, because an "await" allows arbitrary other actions to occur. If an await is introduced in code you transitively call then you might find that some invariant you were expecting to hold has now changed across a call when it previously didn't. Fundamentally, locks are about making invariants atomic and that's independent of exactly how code is scheduled and when.

jayd16 · on March 22, 2023

This is impossible because of *gasp function coloring. You can't just make a sync function async without dealing with the fallout explicitly in c#.

mike_hearn · on March 22, 2023

You can make an async function contain an await at more places than it previously had. When writing code in colored languages your code tends towards lots of stuff being marked async, so more 'await' points being introduced can change behavior.

jayd16 · on March 22, 2023

You're saying something is bad because it could affect a race condition you had? Doesn't everything fall under that issue? That's not changing from invariant to variant.

mike_hearn · on March 22, 2023

I'm not saying it's bad, I'm saying that you can still have races and thus still need some form of synchronization even when using async/await. Whilst in simple cases you can preserve invariants just by carefully choosing where an await is done, as things get more complex you constantly run the risk that someone will introduce another await somewhere else (and maybe more async marked functions to enable that), without understanding that the 'await' can now run code that violates some invariants. Locks and other such mechanisms let you mark certain code as executing atomically regardless of scheduling.

matsemann · on March 21, 2023

> Async/await allows to write concurrent code with no synchronization.

Hmm,I don't see how async/awaits makes a difference. Care to explain?

Like, if you have multiple sources that can add or read from a queue, unless there is a single thread running all your async loops (ala python), you still need some synchronization. At least that's my experience using coroutines heavily in kotlin.

pkolaczk · on March 22, 2023

I'm talking from perspective of Rust's async/await implementation, I'm not sure if the same holds for other languages with async like C# or Kotlin. Nevertheless I can do a loop like this:

    let mut buffer = ... // create buffer for holding data
    let mut input: TcpStream = ... // connect to remote endpoint
    let mut output: TcpStream = ... // connect to remote endpoint
    loop {
       select! {
          _ = input.readable() => {
              input.try_read(&mut buffer)?;
          }
          _ = output.writable(), if !buffer.is_empty() {
              output.try_write(&mut buffer)?;
          }
       }
    }

The mutable buffer is shared between the part that reads from input and the part that writes to output, and reads/writes happen concurrently and independently. Yet there is no explicit locking anywhere!

Synchronization is achieved implicitly by the fact that sequential code executes in only one place at once, so when it executes the reading branch, it does not execute the writing branch. You can apply exactly same reasoning as with any single-threaded, sequential code.

You cannot model this easily with threads. If it was a single thread with blocking I/O, then it could block forever in one branch and stop reacting to events on other branches. If it were multiple threads, then they would somehow need to synchronize accesses explicitly to the shared buffer.

jayd16 · on March 22, 2023

The synchronization is handled by the runtime. While you could still have concurrent access issues, async tasks will not deadlock in c#, unlike lock mechanisms. You don't get concurrency for free, but the finer details of lock management are avoided.

rr808 · on March 22, 2023

Anything is better than asynchronous.