CPU-bound vs IO-bound
This post covers two kinds of programs, CPU-bound and IO-bound, and how to identify them using perf
. Using two
instructions, nop
and pause
, we can construct representatives easily.
- nop: No operation.
- pause: Suspends execution of the thread for a number of cycles to free resources for the sibling SMT thread to proceed.
§CPU-bound: nop instruction
1 | int main() { |
1 | $ clang -O nop.c -o nop && perf stat -- timeout 3s ./nop |
So, the highest IPC on my box is 4 (“3.97 insn per cycle”).
§IO-bound: pause instruction
1 | int main() { |
1 | clang -O pause.c -o pause && perf stat -- timeout 3s ./pause |
The IPC is only 0.01, i.e, the latency of pause
instruction is ~100 cycles.
§Hyperthreading (SMT)
Some commands to identify and control hyperthreading:
1 | # print the total number of logical CPUs |
§Experiment
The following is a series of experiments to uncover the interaction between CPU/IO-bound programs and SMT.
§1. Running two instances of nop
on a single CPU:
1 | (perf stat -o nop.00.1.txt -- taskset -c 0 timeout 3s ./nop) & |
CPU = 0.5 and IPC = 4
Each thread runs for 50% time and when one is running on CPU, it can achieve the highest throughput.
§2. Running two instances of nop
on two CPUs belonging to the same core:
1 | (perf stat -o nop.06.1.txt -- taskset -c 0 timeout 3s ./nop) & |
CPU = 1 and IPC = 2
Each thread has exclusive access to the assigned CPU but the two CPUs share some computing resource due to hyperthreading, cutting the throughput by half for each.
§3. Running two instances of pause
on a single CPU:
1 | (perf stat -o pause.00.1.txt -- taskset -c 0 timeout 3s ./pause) & |
CPU = 0.5 and IPC = 0.01
The same as scenario 1.
§4. Running two instances of pause
on two CPUs belonging to the same core:
1 | (perf stat -o pause.06.1.txt -- taskset -c 0 timeout 3s ./pause) & |
CPU = 1 and IPC = 0.01
IPC stays the same because pause
does not consume much CPU resource.
§5. Running nop
and pause
on two CPUs belonging to the same core:
1 | (perf stat -o nop_pause.06.1.txt -- taskset -c 0 timeout 3s ./nop) & |
CPU = 1 and IPC = 2.5 vs 0.01
pause
can returns some CPU resources to its sibling CPU, compared with scenario 2.
In summary, IPC can be used to quickly check the bottleneck of a program.
§ENV
1 | Clang : 14 |
§References
https://www.baeldung.com/linux/disable-hyperthreading
https://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
https://www.reddit.com/r/hardware/comments/8s011f/skylakex_cpus_have_140cycle_pause_latency_with/