This post covers two kinds of programs, CPU-bound and IO-bound, and how to identify them using perf. Using two instructions, nop and pause, we can construct representatives easily.

  • nop: No operation.
  • pause: Suspends execution of the thread for a number of cycles to free resources for the sibling SMT thread to proceed.

§CPU-bound: nop instruction

1
2
3
4
5
6
7
8
9
10
int main() {
while(1) {
__asm__ (
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop");
}
return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ clang -O nop.c -o nop && perf stat -- timeout 3s ./nop

Performance counter stats for 'timeout 3s ./nop:

2998.88 msec task-clock # 0.998 CPUs utilized
198 context-switches # 66.025 /sec
5 cpu-migrations # 1.667 /sec
159 page-faults # 53.020 /sec
7711317484 cycles # 2.571 GHz
30628182516 instructions # 3.97 insn per cycle
6125658478 branches # 2.043 G/sec
36600 branch-misses # 0.00% of all branches

3.004968603 seconds time elapsed

2.993446000 seconds user
0.008983000 seconds sys

So, the highest IPC on my box is 4 (“3.97 insn per cycle”).

§IO-bound: pause instruction

1
2
3
4
5
6
7
8
9
10
int main() {
while(1) {
__asm__ (
"pause\n\t"
"pause\n\t"
"pause\n\t"
"pause");
}
return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ clang -O pause.c -o pause && perf stat -- timeout 3s ./pause

Performance counter stats for 'timeout 3s ./pause':

3001.66 msec task-clock # 0.999 CPUs utilized
11 context-switches # 3.665 /sec
1 cpu-migrations # 0.333 /sec
167 page-faults # 55.636 /sec
7773558381 cycles # 2.590 GHz
69779529 instructions # 0.01 insn per cycle
14045100 branches # 4.679 M/sec
39164 branch-misses # 0.28% of all branches

3.005058428 seconds time elapsed

3.000096000 seconds user
0.005039000 seconds sys

The IPC is only 0.01, i.e, the latency of pause instruction is ~100 cycles.

§Hyperthreading (SMT)

Some commands to identify and control hyperthreading:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ # print the total number of logical CPUs
$ nproc
12

$ # print logical CPUs in pair
$ grep -H . /sys/devices/system/cpu/cpu*/topology/thread_siblings_list | cut -d: -f2 | sort | uniq
0,6
1,7
2,8
3,9
4,10
5,11

$ # enable/disable hyperthreading / smt
$ echo on | sudo tee /sys/devices/system/cpu/smt/control
$ echo off | sudo tee /sys/devices/system/cpu/smt/control

§Experiment

The following is a series of experiments to uncover the interaction between CPU/IO-bound programs and SMT.

§1. Running two instances of nop on a single CPU:

1
2
(perf stat -o nop.00.1.txt -- taskset -c 0 timeout 3s ./nop) &
(perf stat -o nop.00.2.txt -- taskset -c 0 timeout 3s ./nop) &

CPU = 0.5 and IPC = 4

Each thread runs for 50% time and when one is running on CPU, it can achieve the highest throughput.

§2. Running two instances of nop on two CPUs belonging to the same core:

1
2
(perf stat -o nop.06.1.txt -- taskset -c 0 timeout 3s ./nop) &
(perf stat -o nop.06.2.txt -- taskset -c 6 timeout 3s ./nop) &

CPU = 1 and IPC = 2

Each thread has exclusive access to the assigned CPU but the two CPUs share some computing resource due to hyperthreading, cutting the throughput by half for each.

§3. Running two instances of pause on a single CPU:

1
2
(perf stat -o pause.00.1.txt -- taskset -c 0 timeout 3s ./pause) &
(perf stat -o pause.00.2.txt -- taskset -c 0 timeout 3s ./pause) &

CPU = 0.5 and IPC = 0.01

The same as scenario 1.

§4. Running two instances of pause on two CPUs belonging to the same core:

1
2
(perf stat -o pause.06.1.txt -- taskset -c 0 timeout 3s ./pause) &
(perf stat -o pause.06.2.txt -- taskset -c 6 timeout 3s ./pause) &

CPU = 1 and IPC = 0.01

IPC stays the same because pause does not consume much CPU resource.

§5. Running nop and pause on two CPUs belonging to the same core:

1
2
(perf stat -o nop_pause.06.1.txt -- taskset -c 0 timeout 3s ./nop) &
(perf stat -o nop_pause.06.2.txt -- taskset -c 6 timeout 3s ./pause) &

CPU = 1 and IPC = 2.5 vs 0.01

pause can returns some CPU resources to its sibling CPU, compared with scenario 2.

In summary, IPC can be used to quickly check the bottleneck of a program.

§ENV

1
2
3
4
5
6
Clang        : 14
Linux : 5.10
OS : Debian 11.6
#CPU : 12 (6 cores)
CPU : Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz
Turbo boost : off

§References

https://www.baeldung.com/linux/disable-hyperthreading

https://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html

https://www.reddit.com/r/hardware/comments/8s011f/skylakex_cpus_have_140cycle_pause_latency_with/