Understanding the Go Scheduler: How Goroutines Are Managed
Go is known for its simplicity and its powerful built-in concurrency model. At the heart of this concurrency is the concept of goroutines and the Go scheduler. In this post, we’ll explore how the Go scheduler works, what makes goroutines so efficient, and how they are mapped to system threads using Go’s N:M scheduling model.
Goroutines: Lightweight Threads
Goroutines are lightweight threads managed by the Go runtime. They allow developers to write concurrent programs easily with a simple API:
go func() {
// concurrent execution
}()
You can start a goroutine by prefixing a function call with the go keyword. Here’s an example:
func say(s string) {
for i := 0; i < 5; i++ {
time.Sleep(100 * time.Millisecond)
fmt.Println(s)
}
}
func main() {
go say("world")
say("hello")
}
Go Runtime and Binary Structure
When we compile a Go program, the resulting binary consists of two logical parts:
- Application code: Your own Go code
- Go runtime: Manages goroutines, memory, system calls, and scheduling
For example, when we use the go
statement, it gets translated into a call to
the runtime:
runtime.newProc(...)
This is where a new goroutine is created. The Go runtime then works with the operating system, which ultimately runs the threads on physical CPU cores.
Mapping Goroutines to OS Threads
+------------------+
| Goroutines (G) | <- User space
+------------------+
↓
+------------------+
| OS Threads (M) | <- Kernel space
+------------------+
↓
+------------------+
| CPU Cores (C) | <- Hardware
+------------------+
Key points:
- Goroutines live in user space
- OS threads (M) live in kernel space
- CPUs execute OS threads
- You can have more goroutines than OS threads, and more OS threads than CPU cores
N:M Scheduling Model
Go uses an N:M scheduler, meaning:
- N goroutines
- M OS threads
- Goroutines are multiplexed over OS threads
Why is this useful?
Suppose a goroutine blocks on a system call (e.g., network I/O). The OS thread it’s on also blocks. But with N:M scheduling, the Go scheduler can move other goroutines to another available thread so they can continue running.
Scheduling with Runqueues
To manage goroutines, Go uses runqueues.
- Global Runqueue: A shared queue of goroutines
- Local Runqueue: Each OS thread (technically, each P or “Processor” in Go) has a local runqueue of up to 256 goroutines
Here’s a simplified diagram of scheduling:
Global Runqueue
+------------------+
| G1, G2, G3, ... |
+------------------+
|
| +-----------------+
+--> | OS Thread M1 | --> CPU Core 1
| Local: G4, G5 |
+-----------------+
| +-----------------+
+--> | OS Thread M2 | --> CPU Core 2
| Local: G6, G7 |
+-----------------+
The scheduler prefers local runqueues (faster access) but will fall back to the global runqueue when needed.
GOMAXPROCS and Execution Limits
The GOMAXPROCS
variable limits the number of operating system threads that can
execute user-level Go code simultaneously. There is no limit to the number
of threads that can be blocked in system calls on behalf of Go code; those do
not count against the GOMAXPROCS
limit.
In other words, GOMAXPROCS
limits the number of threads that are actively
executing code. Threads blocked in system calls or otherwise idle are not
counted. This means you can have a small number of executing threads and many
others blocked in system calls.
Each executing thread maintains per-thread state, including a local runqueue. But blocked threads don’t need a runqueue, as the goroutines associated with them can be picked up and run by others. Maintaining state for idle threads is inefficient both in memory and in execution potential.
Introducing Processors (P)
To solve these inefficiencies, Go introduces a new abstraction: the Processor (P).
A P
is a heap-allocated structure responsible for executing Go code. It is
associated with an OS thread (M) and maintains the local runqueue. This
allows the state that was previously maintained per-thread to now be bound to a
processor instead.
The Go scheduler becomes work-stealing: if one processor runs out of goroutines, it can steal work from another processor’s runqueue. This is now scalable, because:
- You only need to check a bounded number of processors (equal to GOMAXPROCS)
- You don’t check all OS threads (which can be unbounded)
So now the architecture looks like this:
+---------------------------+
| Global Runqueue (G) |
+---------------------------+
|
+----+----+----+----+
| | |
+---v--+ +--v---+ +--v---+
| P1 | | P2 | | P3 | ... (GOMAXPROCS)
| RunQ | | RunQ | | RunQ |
+--+---+ +--+---+ +--+---+
| | |
+--v--+ +--v--+ +--v--+
| M1 | | M2 | | M3 |
+-----+ +-----+ +-----+
Fairness and the Convoy Effect
Before moving on, let’s discuss fairness in scheduling.
Imagine a supermarket with a single cashier and a line of customers. Each customer has items to check out. If one customer has many items, the rest will wait longer. This is known as the convoy effect—one long task delays everyone else.
+--------+ +--------+ +--------+
| Cust 1 | ---> | Cust 2 | ---> | Cust 3 |
| 20 itm | | 2 itm | | 5 itm |
+--------+ +--------+ +--------+
|
Cashier (Resource)
In FIFO scheduling, this can lead to unfair resource distribution.
Preemption: Breaking the Convoy
To combat this, we use preemption—temporarily stopping a task so others can run. There are two kinds:
- Cooperative Preemption: The task voluntarily yields the resource.
- Efficient, but not reliable if tasks misbehave.
- Non-Cooperative Preemption: The scheduler forcefully stops a task after a fixed time slice.
- Ensures fairness, though it introduces context-switching overhead.
Go uses non-cooperative preemption to ensure that long-running goroutines don’t hog the CPU. The scheduler cuts them off after a time slice and reschedules them, allowing other goroutines to proceed.
How the Scheduler Picks a Goroutine to Run
Now that we have enough context, a natural question arises: how does the scheduler choose which goroutine to run?
Each processor (P) has a local runqueue, and there is also a global runqueue shared by all processors. Let’s say the currently executing goroutine on a processor completes or exits by calling into the runtime. At this point, the processor must find a new goroutine to execute.
The first place it checks is its local runqueue. Accessing this queue does not involve any locking, so it’s fast and contention-free. If goroutines are available there, one is dequeued and begins execution. This path is the fastest and most efficient.
But what happens if the local runqueue is empty?
Global Runqueue and Load Sharing
If the processor’s local runqueue is empty, it next checks the global runqueue. Because this queue is shared across processors, accessing it requires acquiring a lock, which introduces contention. To minimize this, Go employs some clever strategies.
When a processor steals work from the global runqueue, it doesn’t just take one goroutine. Instead, it takes a batch proportional to:
len(globalRunQueue) / GOMAXPROCS
This strategy ensures fairness across processors and reduces frequent locking by allowing each processor to grab a fair share of work.
Once the batch is moved into the local runqueue, the processor resumes by executing from there.
Checking the Netpoller for Asynchronous I/O
If the global runqueue is also empty, the scheduler looks for another source of goroutines: the netpoller.
The netpoller is responsible for asynchronous I/O, especially network operations. Instead of blocking an OS thread while waiting for data, goroutines register with the netpoller. When the I/O is ready, the netpoller wakes the goroutine.
So, if the global and local runqueues are empty, the processor checks the netpoller. If any goroutines are ready due to I/O events, they’re picked up and executed.
This asynchronous strategy is also used for timers and channel operations, providing non-blocking behavior without wasting OS resources.
Work Stealing from Other Processors
If the netpoller also yields no work, the scheduler moves to its last resort: stealing work from other processors.
Here’s how it works:
- A processor randomly selects another processor.
- If the selected processor has goroutines in its local runqueue, the idle processor steals half of them.
- If the chosen processor has no work (or the picker chooses itself), the process retries up to 4 times.
- Once goroutines are stolen, they are added to the local runqueue and execution resumes.
This bounded and randomized stealing avoids inefficiencies seen in unbounded thread pools and helps balance work across all available processors.
Preemption and Go 1.14: Time-Sliced Scheduling
A significant enhancement came in Go 1.14 with the introduction of non-cooperative (asynchronous) preemption.
Before Go 1.14, goroutines were only preempted cooperatively, meaning they had to call into the runtime (e.g., via function calls) to yield control. Long-running goroutines without such calls could starve other goroutines.
With Go 1.14, each goroutine is granted a 10ms time slice. If it exceeds this without yielding, the runtime forcefully issues a preemption request.
Preemption is implemented via a user-space signal (SIGURG
) sent to the OS
thread running the goroutine. When received, this signal triggers the goroutine
to pause and yield execution.
This ensures fair CPU time for all goroutines, improves responsiveness, and mitigates starvation.
Another important question is: who actually sends the preemption signal, and where does it originate?
This responsibility falls on a long-running daemon in the Go runtime called
sysmon
. This special goroutine operates without a processor (P)-meaning it
does not interfere with user-level Go code execution. The reason it doesn’t
require a processor is that it performs critical background tasks for the
runtime, two of which are particularly important.
The first responsibility is managing preemption. When sysmon
observes that a
goroutine has been running continuously for more than 10 milliseconds, it takes
action. Specifically, it sends a SIGURG signal
to the OS thread executing the
long-running goroutine. This signal serves as a user-space preemption request,
prompting the runtime to forcefully yield that goroutine.
But what happens to a preempted goroutine, especially one that’s running a tight or infinite loop?
Once it’s preempted, the scheduler must decide where to place it for future execution. If we simply put it back into the local runqueue of the current processor, it could cause problems. For instance, imagine the other goroutines in the local runqueue are all short-lived. If the preempted goroutine is inserted back into this queue, it will likely be picked up immediately again—leading to repeated preemption and starvation of shorter tasks.
To prevent this, Go moves the preempted goroutine to the global runqueue instead. This ensures that it is scheduled only after other goroutines have had their chance to run, maintaining system fairness and improving responsiveness. It helps avoid a situation where one CPU-bound goroutine can monopolize execution time while others are left waiting.
This design choice reinforces the Go scheduler’s emphasis on efficiency and fairness, making it resilient against long-running or poorly-behaved goroutines.
Goroutine Spawn Placement: FIFO vs. LIFO
Another important question to consider is: when a goroutine spawns other goroutines, where do the newly spawned goroutines go?
The answer is that these spawned goroutines are placed into the local runqueue of the processor (P) that spawned them. At this point, a design decision arises-should these new goroutines be added to the tail of the queue in a FIFO (First In, First Out) manner, or can something more efficient be done?
FIFO: Fairness at the Cost of Locality
A pure FIFO approach ensures fairness by allowing older goroutines to run first. However, it performs poorly in terms of locality-newly spawned goroutines that are likely to interact soon (e.g., a sender and receiver) would be pushed to the back, incurring unnecessary delay.
LIFO: Locality at the Cost of Fairness
Conversely, LIFO (Last In, First Out) enhances locality by running the newest goroutines first, but this comes at the cost of fairness—older goroutines can starve.
Go’s Hybrid Strategy: Time Slice Inheritance
Go addresses this trade-off through practical optimizations informed by common usage patterns. A frequent pattern involves goroutines communicating over channels, such as a sender and a receiver exchanging data.
Let’s consider a case with unbuffered channels. If the receiver runs first and blocks waiting for data, the sender must run next to unblock it. However, in a strict FIFO queue, the receiver would be pushed to the back and might wait a long time, especially if the runqueue is filled with long-running goroutines.
To avoid this, Go allows new goroutines to be added to the head of the queue, improving responsiveness and locality. But this raises a risk: the newly spawned goroutines could continually preempt older ones, causing starvation.
To mitigate this, Go uses time slice inheritance. When a goroutine spawns another, the new goroutine inherits the remaining time slice of its parent. For example, if a goroutine spawns another at the 3ms mark of its 10ms slice, the child gets 7ms.
This allows both parent and child to share the same 10ms time slice, balancing locality and fairness. After this slice is exhausted, other goroutines get a chance to run.
Global Runqueue Starvation: Periodic Polling
Even with time slice inheritance, goroutines in the global runqueue might starve if processors don’t periodically check the global queue.
To solve this, Go occasionally polls the global runqueue. But how often should this polling occur?
The polling frequency is controlled by a counter called schedtick
, which
increments every time a runnable goroutine is found and executed. If schedtick
is a multiple of 61, the global runqueue is checked.
Blocking System Calls and Handoff
What happens when a goroutine is blocked on a system call? In this case, the thread is unavailable to execute other goroutines—even though there’s work in the local or global runqueue.
To handle this, Go uses handoff:
- The processor and thread are disassociated using
releasep()
. - A new or existing thread is assigned to the processor.
- The processor resumes scheduling other goroutines.
This handoff is expensive, particularly if new threads must be created frequently. Therefore, Go avoids immediate handoff unless it knows the system call will block for a long time (e.g., cgo calls).
For short system calls, the runtime marks the processor as in a system call
state and waits. If the call returns quickly, no thread switching is needed. If
not, the sysmon
goroutine eventually detects the prolonged block and initiates
the handoff.
Returning from System Calls
When a goroutine returns from a system call:
- The scheduler tries to assign it to its original processor to preserve locality.
- If the original processor is busy, it tries an idle processor.
- If none are available, the goroutine is put back into the global runqueue for other processors to pick up.
This strategy optimizes for cache locality and reduces unnecessary context switching.
Conclusion
The Go scheduler is a finely tuned, hybrid system designed to balance performance, fairness, and responsiveness. By blending FIFO and LIFO strategies, leveraging time slice inheritance, and periodically polling the global runqueue, it ensures that goroutines are scheduled efficiently without starving others.
Additionally, features like the sysmon daemon, preemption signals, and the handoff mechanism for system calls demonstrate the scheduler’s adaptability in real-world scenarios, especially in handling concurrency bottlenecks and maximizing CPU utilization.
Understanding these internal mechanisms provides a deep appreciation for Go’s concurrency model and offers insights into writing more performant and predictable concurrent programs. The elegance of Go’s scheduler lies not just in its algorithms, but in its practical optimizations rooted in common usage patterns, achieving a delicate balance between theoretical soundness and engineering pragmatism.