CPU-bound vs IO-bound
This post covers two kinds of programs, CPU-bound and IO-bound, and how to identify them using perf
. Using two
instructions, nop
and pause
, we can construct representatives easily.
- nop: No operation.
- pause: Suspends execution of the thread for a number of cycles to free resources for the sibling SMT thread to proceed.
§CPU-bound: nop instruction
1 | int main() { |
1 | $ clang -O nop.c -o nop && perf stat -- timeout 3s ./nop |
So, the highest IPC on my box is 4 (“3.97 insn per cycle”).
§IO-bound: pause instruction
1 | int main() { |
1 | clang -O pause.c -o pause && perf stat -- timeout 3s ./pause |
The IPC is only 0.01, i.e, the latency of pause
instruction is ~100 cycles.
§Hyperthreading (SMT)
Some commands to identify and control hyperthreading:
1 | # print the total number of logical CPUs |
§Experiment
The following is a series of experiments to uncover the interaction between CPU/IO-bound programs and SMT.
§1. Running two instances of nop
on a single CPU:
1 | (perf stat -o nop.00.1.txt -- taskset -c 0 timeout 3s ./nop) & |
CPU = 0.5 and IPC = 4
Each thread runs for 50% time and whichever is running on CPU, it can achieve the highest throughput.
§2. Running two instances of nop
on two CPUs belonging to the same core:
1 | (perf stat -o nop.06.1.txt -- taskset -c 0 timeout 3s ./nop) & |
CPU = 1 and IPC = 2
Each thread has exclusive access to the assigned CPU but the two CPUs share some computing resource due to hyperthreading, cutting the throughput by half for each.
Lesson from scenario 1 & 2: for CPU-bound programs, hyperthreading is useless, the overall throughput is the same, CPU * IPC = 2.
§3. Running two instances of pause
on a single CPU:
1 | (perf stat -o pause.00.1.txt -- taskset -c 0 timeout 3s ./pause) & |
CPU = 0.5 and IPC = 0.01
The same as scenario 1.
§4. Running two instances of pause
on two CPUs belonging to the same core:
1 | (perf stat -o pause.06.1.txt -- taskset -c 0 timeout 3s ./pause) & |
CPU = 1 and IPC = 0.01
IPC stays the same because pause
does not consume much CPU resource.
Lesson from scenario 3 & 4: for IO-bound programs, hyperthreading is super useful, the overall throughput doubles.
§5. Running nop
and pause
on two CPUs belonging to the same core:
1 | (perf stat -o nop_pause.06.1.txt -- taskset -c 0 timeout 3s ./nop) & |
CPU = 1 and IPC = 2.5 vs 0.01
pause
can returns some CPU resources to its sibling CPU, compared with scenario 2.
In summary, IPC can be used to quickly check the bottleneck of a program.
§ENV
1 | Clang : 14 |
§References
https://www.baeldung.com/linux/disable-hyperthreading
https://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
https://www.reddit.com/r/hardware/comments/8s011f/skylakex_cpus_have_140cycle_pause_latency_with/