This post collects some info regarding thread-level context switch and how to estimate/measure its cost.

In computing, a context switch is the process of storing the state of a process or thread, so that it can be restored and resume execution at a later point, and then restoring a different, previously saved, state.

In C/C++, one can use getrusage to retrieve #context switches a process/thread has gone through. The two metrics on context switches are:

  • voluntary switch: the current thread becomes blocked from calling some special methods, e.g. this_thread::sleep_for
  • involuntary switch: the kernel decides that the current thread should be paused, in favor of another thread, possibly because the current thread has used up its CPU quota or another thread has higher priority

Some works involved inside a context switch:

  • user to kernel mode transition
  • save CPU registers (e.g. stack pointer) for current thread
  • if the current thread becomes blocked (e.g. in voluntary switch), update the thread state and remove it from the runnable threads queue
  • load CPU registers for next thread

(Here I am focusing only on things inside a single process, ignoring context switches between processes.)

§Measurement

The following is a simple ping-pong example using mutex + condition variable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#include <iostream>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <chrono>
#include <sys/resource.h>

using namespace std;
enum Turn {
PING,
PONG,
};

mutex m;
condition_variable cv;
Turn flag = PING;

constexpr int max_count = 20000;

void send_msg(string msg) {
// cout << msg << endl;
}

void pong_runnable() {
unique_lock lk(m);
for (auto i = 0; i < max_count; ++i) {
cv.wait(lk, []{return flag == PONG;});

send_msg("Pong");
flag = PING;

cv.notify_one();
}
}

int main()
{
thread pong_worker(pong_runnable);
auto start = chrono::steady_clock::now();

{
unique_lock lk(m);
for (auto i = 0; i < max_count; ++i) {
cv.wait(lk, []{return flag == PING;});

send_msg("Ping");
flag = PONG;

cv.notify_one();
}
}

// wait for final PONG msg; must be inside the meansurement window
pong_worker.join();

auto end = chrono::steady_clock::now();

auto time_ms = chrono::duration_cast<chrono::microseconds>(end - start).count() / 1000.0;

printf("Elapsed time: %.3f ms\n", time_ms);

struct rusage ru;
if (getrusage(RUSAGE_SELF, &ru)) {
perror("getrusage");
} else {
printf(" voluntary switches = %ld\n", ru.ru_nvcsw);
printf(" involuntary switches = %ld\n", ru.ru_nivcsw);
}
return 0;
}
1
2
3
4
5
$ clang++ -O -std=c++20 context_switch.cpp && taskset -c 0 ./a.out

Elapsed time: 107.993 ms
voluntary switches = 39999
involuntary switches = 40000

As a crude approximation, one can assume both kinds of switches have the same cost and each context-switch takes ~1.3 microseconds. Additionally, the elapsed time includes also use-mode operations (for-loop, accessing flag, etc), so this is an overestimate.

One can use perf to get a quick glance at the user/kernel-mode distribution, reported by user, and sys, respectively.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ clang++ -O -std=c++20 context_switch.cpp && perf stat taskset -c 0 ./a.out

Elapsed time: 132.224 ms

Performance counter stats for 'taskset -c 0 ./a.out':

133.81 msec task-clock # 0.992 CPUs utilized
79999 context-switches # 597.877 K/sec
1 cpu-migrations # 7.474 /sec
199 page-faults # 1.487 K/sec
327280691 cycles # 2.446 GHz
479283260 instructions # 1.46 insn per cycle
112246105 branches # 838.877 M/sec
1114687 branch-misses # 0.99% of all branches

0.134864839 seconds time elapsed

0.012264000 seconds user
0.122640000 seconds sys

We can see that user time is ~10% of sys time. Therefore, the overestimate is roughly off by ~10%, 0.1 microseconds. Note that “Elapsed time” has increased to 132ms, so the overhead of perf could have skewed those statistics a bit. Nonetheless, one can use 1 microsecond as the latency of context switch for a back-of-the-envelope calculation.

§ENV

1
2
3
4
5
6
Clang        : 14
Linux : 5.10
OS : Debian 11.6
#CPU : 12 (6 cores)
CPU : Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz
Turbo boost : off

§References

https://medium.com/geekculture/linux-cpu-context-switch-deep-dive-764bfdae4f01

https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/