This post covers two kinds of programs, CPU-bound and IO-bound, and how to identify them using perf. Using two instructions, nop and pause, we can construct representatives easily.

  • nop: No operation.
  • pause: Suspends execution of the thread for a number of cycles to free resources for the sibling SMT thread to proceed.

§CPU-bound: nop instruction

1
2
3
4
5
6
7
8
9
10
int main() {
while(1) {
__asm__ (
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop");
}
return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ clang -O nop.c -o nop && perf stat -- timeout 3s ./nop

Performance counter stats for 'timeout 3s ./nop:

2998.88 msec task-clock # 0.998 CPUs utilized
198 context-switches # 66.025 /sec
5 cpu-migrations # 1.667 /sec
159 page-faults # 53.020 /sec
7711317484 cycles # 2.571 GHz
30628182516 instructions # 3.97 insn per cycle
6125658478 branches # 2.043 G/sec
36600 branch-misses # 0.00% of all branches

3.004968603 seconds time elapsed

2.993446000 seconds user
0.008983000 seconds sys

So, the highest IPC on my box is 4 (“3.97 insn per cycle”).

§IO-bound: pause instruction

1
2
3
4
5
6
7
8
9
10
int main() {
while(1) {
__asm__ (
"pause\n\t"
"pause\n\t"
"pause\n\t"
"pause");
}
return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ clang -O pause.c -o pause && perf stat -- timeout 3s ./pause

Performance counter stats for 'timeout 3s ./pause':

3001.66 msec task-clock # 0.999 CPUs utilized
11 context-switches # 3.665 /sec
1 cpu-migrations # 0.333 /sec
167 page-faults # 55.636 /sec
7773558381 cycles # 2.590 GHz
69779529 instructions # 0.01 insn per cycle
14045100 branches # 4.679 M/sec
39164 branch-misses # 0.28% of all branches

3.005058428 seconds time elapsed

3.000096000 seconds user
0.005039000 seconds sys

The IPC is only 0.01, i.e, the latency of pause instruction is ~100 cycles.

§Hyperthreading (SMT)

Some commands to identify and control hyperthreading:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ # print the total number of logical CPUs
$ nproc
12

$ # print logical CPUs in pair
$ grep -H . /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -d: -f2 | sort | uniq
0,6
1,7
2,8
3,9
4,10
5,11

$ # enable/disable hyperthreading / smt
$ echo on | sudo tee /sys/devices/system/cpu/smt/control
$ echo off | sudo tee /sys/devices/system/cpu/smt/control

§Experiment

The following is a series of experiments to uncover the interaction between CPU/IO-bound programs and SMT.

§1. Running two instances of nop on a single CPU:

1
2
(perf stat -o nop.00.1.txt -- taskset -c 0 timeout 3s ./nop) &
(perf stat -o nop.00.2.txt -- taskset -c 0 timeout 3s ./nop) &

CPU = 0.5 and IPC = 4

Each thread runs for 50% time and whichever is running on CPU, it can achieve the highest throughput.

§2. Running two instances of nop on two CPUs belonging to the same core:

1
2
(perf stat -o nop.06.1.txt -- taskset -c 0 timeout 3s ./nop) &
(perf stat -o nop.06.2.txt -- taskset -c 6 timeout 3s ./nop) &

CPU = 1 and IPC = 2

Each thread has exclusive access to the assigned CPU but the two CPUs share some computing resource due to hyperthreading, cutting the throughput by half for each.

Lesson from scenario 1 & 2: for CPU-bound programs, hyperthreading is useless, the overall throughput is the same, CPU * IPC = 2.

§3. Running two instances of pause on a single CPU:

1
2
(perf stat -o pause.00.1.txt -- taskset -c 0 timeout 3s ./pause) &
(perf stat -o pause.00.2.txt -- taskset -c 0 timeout 3s ./pause) &

CPU = 0.5 and IPC = 0.01

The same as scenario 1.

§4. Running two instances of pause on two CPUs belonging to the same core:

1
2
(perf stat -o pause.06.1.txt -- taskset -c 0 timeout 3s ./pause) &
(perf stat -o pause.06.2.txt -- taskset -c 6 timeout 3s ./pause) &

CPU = 1 and IPC = 0.01

IPC stays the same because pause does not consume much CPU resource.

Lesson from scenario 3 & 4: for IO-bound programs, hyperthreading is super useful, the overall throughput doubles.

§5. Running nop and pause on two CPUs belonging to the same core:

1
2
(perf stat -o nop_pause.06.1.txt -- taskset -c 0 timeout 3s ./nop) &
(perf stat -o nop_pause.06.2.txt -- taskset -c 6 timeout 3s ./pause) &

CPU = 1 and IPC = 2.5 vs 0.01

pause can returns some CPU resources to its sibling CPU, compared with scenario 2.

In summary, IPC can be used to quickly check the bottleneck of a program.

§ENV

1
2
3
4
5
6
Clang        : 14
Linux : 5.10
OS : Debian 11.6
#CPU : 12 (6 cores)
CPU : Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz
Turbo boost : off

§References

https://www.baeldung.com/linux/disable-hyperthreading

https://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html

https://www.reddit.com/r/hardware/comments/8s011f/skylakex_cpus_have_140cycle_pause_latency_with/