This post is pretty good read on cache effect. Even though C# is used in that article,
the author claims that language choice has little effect on the conclusion. Since I don’t have access to C# compilers on my Linux box, I converted the
examples to C to reproduce it. Surprisingly, the result is not 100% consistent with his observation.
There are seven examples in the original post, but ex1 is just a special case of ex2, so I would only discuss ex[2..7].
$ clang -O test.c -lpthread ; ./a.out == ex2 == step is 1: Elapsed time: 0.044202 seconds step is 2: Elapsed time: 0.041431 seconds step is 4: Elapsed time: 0.040507 seconds step is 8: Elapsed time: 0.040617 seconds step is 16: Elapsed time: 0.041350 seconds step is 32: Elapsed time: 0.026930 seconds step is 64: Elapsed time: 0.016239 seconds step is 128: Elapsed time: 0.006211 seconds step is 256: Elapsed time: 0.003147 seconds step is 512: Elapsed time: 0.001748 seconds
For step in [1..16], the access is within the cache line, so exec time is not much different. Starting from 32, the number of cache lines used
drops by half compared with next step size, which accounts for the diminishing of exec time.
The second case is ~3x slower, which I believe is a [bug]https://llvm.org/bugs/show_bug.cgi?id=28128() in auto-vectorizatoin of clang. The third and
forth case are ~2x slower, because the consecutive multiplication in the first case could be collapsed into one, while infeasible for latter cases.
voidex4() { puts("== ex4 =="); static uint acc[2] = {1, 1}; tik(); for (uint i = 0; i < i_max; ++i) { acc[0] *= 3; acc[0] *= 3; } tik(); acc[0] = acc[1] = 1; tik(); for (uint i = 0; i < i_max; ++i) { acc[0] *= 3; acc[1] *= 3; } tik(); acc[0] = acc[1] = 1; tik(); #pragma clang loop vectorize(disable) for (uint i = 0; i < i_max; ++i) { acc[0] *= 3; acc[1] *= 3; } tik(); acc[0] = acc[1] = 1; tik(); for (uint i = 0; i < i_max; ++i) { acc[0] *= 3; } for (uint i = 0; i < i_max; ++i) { acc[1] *= 3; } tik(); }
voidex5() { staticconst uint arr_len = 4*1024*1024; staticunsignedchar arr[arr_len]; for (uint K = 1; K <= 512; ++K) { tik(); uint p = 0; for (uint i = 0; i < i_max; ++i) { arr[p]++; p += K; if (p >= arr_len) { p = 0; } } tik(); } }
staticvoidinc_counter(uint p) { for (uint i = 0; i < i_max; ++i) { s_counter[p] += 3; } }
voidex6() { puts("== ex6 =="); staticconstuintptr_t n_threads = 4; pthread_t tasks[n_threads]; tik(); for (uintptr_t i = 0; i < n_threads; ++i) { pthread_create(&tasks[i], NULL, (void*)&inc_counter, (void*)i); } for (uintptr_t i = 0; i < n_threads; ++i) { pthread_join(tasks[i], NULL); } tik(); tik(); for (uintptr_t i = 0; i < n_threads; ++i) { pthread_create(&tasks[i], NULL, (void*)&inc_counter, (void*)(16+i*16)); } for (uintptr_t i = 0; i < n_threads; ++i) { pthread_join(tasks[i], NULL); } tik(); }
voidex7() { puts("== ex7 =="); A = B = C = D = E = F = G = 1;
tik(); for (uint i = 0; i < i_max; ++i) { A *= 3; B *= 3; C *= 3; D *= 3; } tik(); tik(); for (uint i = 0; i < i_max; ++i) { A *= 3; C *= 3; E *= 3; G *= 3; } tik(); tik(); for (uint i = 0; i < i_max; ++i) { A *= 3; C *= 3; } tik(); }