Deviant/Abstraction

Fast CPU vs Cache

Published on

I’m trying to understand how a contemporary CPU works and sharing my work here.

Let’s start with something simple: compare computing a value vs. fetching it from RAM. Then, let’s try to understand if this result is “permanent” or context-dependent.

This is my CPU, and I’m running popOS:

Accessing a value

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

// Function to get current time in nanoseconds using clock_gettime
long long get_time_ns() {
struct timespec tp;
clock_gettime(CLOCK_MONOTONIC, &tp);
return (long long)tp.tv_sec * 1000000000LL + tp.tv_nsec;
}

int main() {
// Now declare an integer and a pointer to it
int myInt = 123;
int* myIntPtr = &myInt;

// Start timing
long long start_time = get_time_ns();

// Access the integer here, to measure the time it might take (potentially including a cache miss penalty)
printf("Integer value: %d\n", *myIntPtr);

// Stop timing
long long end_time = get_time_ns();

// Calculate and print the elapsed time in nanoseconds
printf("Time taken to access the integer: %lld ns\n", end_time - start_time);


return 0;
}

Running it:

Integer value: 123
Time taken to access the integer: 39355 ns

Computing a value

We only add an addition (122 + 1); the rest of the code is roughly the same.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>


long get_time_ns() {
struct timespec tp;
clock_gettime(CLOCK_MONOTONIC, &tp);
return (long long)tp.tv_sec * 1000000000LL + tp.tv_nsec;
}


int main() {

// Now declare an integer and a pointer to it
int myInt = 0;

int* myIntPtr = &myInt;

// Start timing
long long start_time = get_time_ns();
myInt = 1 + 122;

// Access the integer here, to measure the time it might take (potentially including a cache miss penalty)
printf("Integer value: %d\n", *myIntPtr);

// Stop timing
long long end_time = get_time_ns();

// Calculate and print the elapsed time in nanoseconds
printf("Time taken to access the integer: %lld ns\n", end_time - start_time);

return 0;
}

I compiled and checked the assembly code to ensure the addition stayed in the code.

Integer value: 123
Time taken to access the integer: 39695 ns

NB. I executed these results over 100 times to confirm the results.

Conclusion: it’s slightly longer to compute a value than accessing it.

VOIDING THE PIPELINE

CPU has a pipeline. The pipeline predicts which instructions will be needed and prefetch it and its data. Is the performance assumption still valid in this case?

We can’t easily reset the instruction cache, but we can reset the data cache fairly easily by writing random data in the CPU. For instance:

    // Allocate a large array to fill the CPU cache
char* cacheFiller = (char*)malloc(CACHE_SIZE);
if (!cacheFiller) {
perror("Failed to allocate memory");
return 1;
}

// Access the array to ensure it's loaded into the cache, potentially evicting other data
for (int i = 0; i < CACHE_SIZE; i++) {
cacheFiller[i] = (char)(i & 0xFF);
}

If I reset the cache before fetching the value:

Integer value: 123
Time taken to access the integer: 28574 ns

It’s faster than the version with no cache! And if we add:

Integer value: 123
Time taken to access the integer: 26350 ns

This time, computing is faster than fetching the value.

Conclusions

We’ve unearthed a few interesting questions.

  1. Why is the same computation significantly faster once the CPU cache is reset?
  2. Why computing the value is faster than fetching it from RAM
  3. Is it specific to my machine or is it the same on all machines?
  4. What happens when the computations are not trivial? Do these results hold?

Discover more from Deviant/Abstraction

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from Deviant/Abstraction

Subscribe now to keep reading and get access to the full archive.

Continue reading