CPU-bound vs IO-bound

This post covers two kinds of programs, CPU-bound and IO-bound, and how to identify them using perf. Using two instructions, nop and pause, we can construct representatives easily.

nop: No operation.
pause: Suspends execution of the thread for a number of cycles to free resources for the sibling SMT thread to proceed.

§CPU-bound: nop instruction

int main() {
  while(1) {
    __asm__ (
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop");
  }
  return 0;
}

$ clang -O nop.c -o nop && perf stat -- timeout 3s ./nop

 Performance counter stats for 'timeout 3s ./nop:

           2998.88 msec task-clock                       #    0.998 CPUs utilized
               198      context-switches                 #   66.025 /sec
                 5      cpu-migrations                   #    1.667 /sec
               159      page-faults                      #   53.020 /sec
        7711317484      cycles                           #    2.571 GHz
       30628182516      instructions                     #    3.97  insn per cycle
        6125658478      branches                         #    2.043 G/sec
             36600      branch-misses                    #    0.00% of all branches

       3.004968603 seconds time elapsed

       2.993446000 seconds user
       0.008983000 seconds sys

So, the highest IPC on my box is 4 (“3.97 insn per cycle”).

§IO-bound: pause instruction

int main() {
  while(1) {
    __asm__ (
        "pause\n\t"
        "pause\n\t"
        "pause\n\t"
        "pause");
  }
  return 0;
}

$ clang -O pause.c -o pause && perf stat -- timeout 3s ./pause

 Performance counter stats for 'timeout 3s ./pause':

           3001.66 msec task-clock                       #    0.999 CPUs utilized
                11      context-switches                 #    3.665 /sec
                 1      cpu-migrations                   #    0.333 /sec
               167      page-faults                      #   55.636 /sec
        7773558381      cycles                           #    2.590 GHz
          69779529      instructions                     #    0.01  insn per cycle
          14045100      branches                         #    4.679 M/sec
             39164      branch-misses                    #    0.28% of all branches

       3.005058428 seconds time elapsed

       3.000096000 seconds user
       0.005039000 seconds sys

The IPC is only 0.01, i.e, the latency of pause instruction is ~100 cycles.

§Hyperthreading (SMT)

Some commands to identify and control hyperthreading:

$ # print the total number of logical CPUs
$ nproc
12

$ # print logical CPUs in pair
$ grep -H . /sys/devices/system/cpu/cpu*/topology/thread_siblings_list  | cut -d: -f2 | sort | uniq
0,6
1,7
2,8
3,9
4,10
5,11

$ # enable/disable hyperthreading / smt
$ echo on  | sudo tee /sys/devices/system/cpu/smt/control
$ echo off | sudo tee /sys/devices/system/cpu/smt/control

§Experiment

The following is a series of experiments to uncover the interaction between CPU/IO-bound programs and SMT.

§1. Running two instances of `nop` on a single CPU:

1 2	(perf stat -o nop.00.1.txt -- taskset -c 0 timeout 3s ./nop) & (perf stat -o nop.00.2.txt -- taskset -c 0 timeout 3s ./nop) &

CPU = 0.5 and IPC = 4

Each thread runs for 50% time and whichever is running on CPU, it can achieve the highest throughput.

§2. Running two instances of `nop` on two CPUs belonging to the same core:

1 2	(perf stat -o nop.06.1.txt -- taskset -c 0 timeout 3s ./nop) & (perf stat -o nop.06.2.txt -- taskset -c 6 timeout 3s ./nop) &

CPU = 1 and IPC = 2

Each thread has exclusive access to the assigned CPU but the two CPUs share some computing resource due to hyperthreading, cutting the throughput by half for each.

Lesson from scenario 1 & 2: for CPU-bound programs, hyperthreading is useless, the overall throughput is the same, CPU * IPC = 2.

§3. Running two instances of `pause` on a single CPU:

1 2	(perf stat -o pause.00.1.txt -- taskset -c 0 timeout 3s ./pause) & (perf stat -o pause.00.2.txt -- taskset -c 0 timeout 3s ./pause) &

CPU = 0.5 and IPC = 0.01

The same as scenario 1.

§4. Running two instances of `pause` on two CPUs belonging to the same core:

1 2	(perf stat -o pause.06.1.txt -- taskset -c 0 timeout 3s ./pause) & (perf stat -o pause.06.2.txt -- taskset -c 6 timeout 3s ./pause) &

CPU = 1 and IPC = 0.01

IPC stays the same because pause does not consume much CPU resource.

Lesson from scenario 3 & 4: for IO-bound programs, hyperthreading is super useful, the overall throughput doubles.

§5. Running `nop` and `pause` on two CPUs belonging to the same core:

1 2	(perf stat -o nop_pause.06.1.txt -- taskset -c 0 timeout 3s ./nop) & (perf stat -o nop_pause.06.2.txt -- taskset -c 6 timeout 3s ./pause) &

CPU = 1 and IPC = 2.5 vs 0.01

pause can returns some CPU resources to its sibling CPU, compared with scenario 2.

In summary, IPC can be used to quickly check the bottleneck of a program.

§ENV

Clang        : 14
Linux        : 5.10
OS           : Debian 11.6
#CPU         : 12 (6 cores)
CPU          : Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz
Turbo boost  : off

§References

https://www.baeldung.com/linux/disable-hyperthreading

https://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html

https://www.reddit.com/r/hardware/comments/8s011f/skylakex_cpus_have_140cycle_pause_latency_with/

§CPU-bound: nop instruction

§IO-bound: pause instruction

§Hyperthreading (SMT)

§Experiment

§1. Running two instances of nop on a single CPU:

§2. Running two instances of nop on two CPUs belonging to the same core:

§3. Running two instances of pause on a single CPU:

§4. Running two instances of pause on two CPUs belonging to the same core:

§5. Running nop and pause on two CPUs belonging to the same core: