Context Switch on Linux

This post collects some info regarding thread-level context switch and how to estimate/measure its cost.

In computing, a context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point, and then restoring a different, previously saved, state.

In C/C++, one can use getrusage to retrieve #context switches a process/thread has gone through. The two metrics on context switches are:

voluntary switch: the current thread becomes blocked from calling some special methods, e.g. this_thread::sleep_for
involuntary switch: the kernel decides that the current thread should be paused, in favor of another thread, possibly because the current thread has used up its CPU quota or another thread has higher priority

Some works involved inside a context switch:

user to kernel mode transition
save CPU registers (e.g. stack pointer) for current thread
if the current thread becomes blocked (e.g. in voluntary switch), update the thread state and remove it from the runnable threads queue
load CPU registers for next thread

(Here I am focusing only on things inside a single process, ignoring context switches between processes.)

§Measurement

The following is a simple ping-pong example using mutex + condition variable.

#include <iostream>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <sys/resource.h>

using namespace std;
enum Turn {
    PING,
    PONG,
};

mutex m;
condition_variable cv;
Turn flag = PING;

constexpr int max_count = 20000;

void send_msg(string msg) {
    // cout << msg << endl;
}

void pong_runnable() {
    unique_lock lk(m);
    for (auto i = 0; i < max_count; ++i) {
        cv.wait(lk, []{return flag == PONG;});

        send_msg("Pong");
        flag = PING;

        cv.notify_one();
    }
}

int main()
{
    thread pong_worker(pong_runnable);
    auto start = chrono::steady_clock::now();

    {
        unique_lock lk(m);
        for (auto i = 0; i < max_count; ++i) {
            cv.wait(lk, []{return flag == PING;});

            send_msg("Ping");
            flag = PONG;

            cv.notify_one();
        }
    }

    // wait for final PONG msg; must be inside the meansurement window
    pong_worker.join();

    auto end = chrono::steady_clock::now();

    auto time_ms = chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000.0;

    printf("Elapsed time: %.3f ms\n", time_ms);

    struct rusage ru;
    if (getrusage(RUSAGE_SELF, &ru)) {
        perror("getrusage");
    } else {
        printf("    voluntary switches = %ld\n", ru.ru_nvcsw);
        printf("  involuntary switches = %ld\n", ru.ru_nivcsw);
    }
    return 0;
}

$ clang++ -O -std=c++20 context_switch.cpp && taskset -c 0 ./a.out

Elapsed time: 107.993 ms
    voluntary switches = 39999
  involuntary switches = 40000

As a crude approximation, one can assume both kinds of switches have the same cost and each context-switch takes ~1.3 microseconds. Additionally, the elapsed time includes also use-mode operations (for-loop, accessing flag, etc), so this is an overestimate.

One can use perf to get a quick glance at the user/kernel-mode distribution, reported by user, and sys, respectively.

$ clang++ -O -std=c++20 context_switch.cpp && perf stat taskset -c 0 ./a.out

Elapsed time: 132.224 ms

 Performance counter stats for 'taskset -c 0 ./a.out':

            133.81 msec task-clock                       #    0.992 CPUs utilized
             79999      context-switches                 #  597.877 K/sec
                 1      cpu-migrations                   #    7.474 /sec
               199      page-faults                      #    1.487 K/sec
         327280691      cycles                           #    2.446 GHz
         479283260      instructions                     #    1.46  insn per cycle
         112246105      branches                         #  838.877 M/sec
           1114687      branch-misses                    #    0.99% of all branches

       0.134864839 seconds time elapsed

       0.012264000 seconds user
       0.122640000 seconds sys

We can see that user time is ~10% of sys time. Therefore, the overestimate is roughly off by ~10%, 0.1 microseconds. Note that “Elapsed time” has increased to 132ms, so the overhead of perf could have skewed those statistics a bit. Nonetheless, one can use 1 microsecond as the latency of context switch for a back-of-the-envelope calculation.

§ENV

Clang        : 14
Linux        : 5.10
OS           : Debian 11.6
#CPU         : 12 (6 cores)
CPU          : Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz
Turbo boost  : off

§References

https://medium.com/geekculture/linux-cpu-context-switch-deep-dive-764bfdae4f01

https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/