Cache: Why Keeping Data Close to Where It's Needed Is a Universal Principle

In the previous article of this series, we saw how a system can collapse when the database receives more requests than it can process, and how cache was the intervention that restored operational stability. Cache was presented as a layer that stores the results of costly operations to avoid repeating them. But that description, while correct, omits something important. Why does cache work? Why does storing data somewhere else speed up the system? And why does this principle appear at virtually every level of modern computing, from the transistor inside the processor to edge servers distributed across continents?

The answer begins with a problem hardware engineers faced more than sixty years ago, one that has never stopped being relevant.

The problem that was never solved, only managed

Since the beginning of modern computing, speed and storage capacity have moved in opposite directions. Fast memory is expensive and small. Cheap memory is large and slow. This antagonism has physical and economic roots that have not changed since the 1960s.

In practice, this means a modern processor can execute billions of operations per second, but access to main memory can be hundreds of times slower than the instruction execution speed. When the processor needs a piece of data that is not immediately available, it stops and waits. That wait time is called memory latency, and it is the bottleneck that defines the real-world performance of most programs in existence.

The orders of magnitude that govern this problem were catalogued by Jeff Dean and later updated by Peter Norvig on his personal website, and are a reference in distributed systems literature. In the previous article of this series we already cited some of those numbers in the context of databases. What matters here is the proportion within the hardware itself. A main memory access takes around 100 nanoseconds, while an access to the processor's L1 cache takes around 0.5 nanoseconds, two hundred times faster. These numbers vary across hardware generations, but the proportions have remained stable for decades.

graph LR
    A["Registers
~0.3 ns"] --> B["L1 Cache
~1 ns"]
    B --> C["L2 Cache
~4 ns"]
    C --> D["L3 Cache
~10 ns"]
    D --> E["RAM
~100 ns"]
    E --> F["NVMe SSD
~100 µs"]
    F --> G["Magnetic disk
~10 ms"]
    G --> H["Network / CDN
~50-150 ms"]

    style A fill:#1a472a,color:#fff
    style B fill:#1e5631,color:#fff
    style C fill:#2d7a3e,color:#fff
    style D fill:#3a9e52,color:#fff
    style E fill:#f0a500,color:#000
    style F fill:#e05c00,color:#fff
    style G fill:#c0392b,color:#fff
    style H fill:#922b21,color:#fff

The answer engineering found was to create intermediate layers. Instead of choosing between fast memory or large memory, modern systems use both at the same time, in a hierarchy. The most-used data stays at the fastest levels. The least-used data stays at the slower levels. And movement between levels happens automatically, guided by a principle observed empirically in computer programs.

The principle that made everything possible

In 1966, Les Belady of IBM published in the IBM Systems Journal a study on page replacement algorithms in systems with virtual memory. The central problem Belady investigated was deciding which page to evict when the cache is full and a new page needs to enter. To solve it, he proposed what became known as the Belady algorithm, which always evicts the page that will be referenced furthest in the future. This algorithm is theoretically optimal, but it presupposes knowledge of the future, which makes it impractical. Its value lies in establishing the performance ceiling that any real algorithm can aspire to.

What Belady's work left implicit, Peter Denning formalized two years later. Denning realized that the problem of replacement algorithms only existed because programs did not access memory randomly. If access were random, any piece of data in the cache would have the same probability of being requested again, and the question of which to evict would be irrelevant. The fact that replacement algorithms make a real difference in practice is evidence that memory access follows a structured pattern, which became known as locality of reference.

In 1968, Denning published in the Communications of the ACM his working set model paper, which received the ACM Best Paper Award for systems that year. There, he showed that every running program, over any reasonable time interval, actively uses only a small fraction of all memory pages it owns. He called that active set the working set.

"Locality was adopted almost immediately as an idea by operating system, database, and hardware architects."

— Peter Denning, The Locality Principle, Communications of the ACM, 2005

Locality of reference has two dimensions. The first is temporal locality. A piece of data that was recently accessed has a high probability of being accessed again soon. When a program reads the value of a variable, there is a good chance it will need that value again a few cycles later. The second is spatial locality. When a piece of data at a given memory address is accessed, data at nearby addresses has a high probability of being accessed next. This happens because programs process arrays, lists, and data structures stored in contiguous blocks of memory.

These two properties are what makes cache effective. If data followed a random access distribution, storing some of it in a faster location would not help much, because the chance of a hit would be low. But because access is concentrated and pattern-predictable, even a relatively small cache can absorb a large fraction of accesses.

The working set concept is not specific to processor memory. In a web system with Redis, the working set is the set of data that active users are accessing at that moment. This includes the most visited products, the profiles of users with open sessions, and the configurations loaded by each application instance. That set tends to be a small fraction of the total database, but it concentrates the vast majority of read traffic. That is why a Redis instance with a few gigabytes can absorb most queries from a database with terabytes of data.

How the processor implements cache in hardware

Maurice Wilkes, a British scientist at the University of Cambridge, was the first to formally propose the idea of an intermediate memory between the processor and main memory. In his 1965 paper, published in the IEEE Transactions on Electronic Computers, he described the concept in terms that still serve as a reference.

"It is proposed to use a fast core store of, say, 32,000 words as a slave to a slower core store of, say, one million words in such a way that in practice the effective access time is nearer that of the fast store than that of the slow store."

— Maurice Wilkes, Slave Memories and Dynamic Storage Allocation, IEEE Transactions on Electronic Computers, 1965

The first commercial implementation appeared in the IBM System/360 Model 85, in 1969, with 16 KB of cache. It was a minuscule amount of fast memory, but enough to demonstrate that the principle worked in practice.

Today, a general-purpose modern processor has three or four levels of cache integrated into the chip. The L1 cache is closest to the processing cores, smaller and faster, typically between 32 KB and 64 KB per core. The L2 cache is larger and slightly slower, in the range of 256 KB to a few MB per core. The L3 cache is shared among cores, with tens of MB, and serves as the last line before needing to go to main memory.

When the processor needs a piece of data, it traverses this hierarchy in increasing order of latency. If the data is in L1, access takes around 4 clock cycles. If not, it goes to L2, which takes around 12 cycles. If not there, it goes to L3, with around 30 to 40 cycles. If not in L3, it goes to main memory, with a latency of 200 to 300 cycles. The difference between finding data in L1 and needing to go to main memory is 50 to 75 times in terms of latency.

sequenceDiagram
    participant CPU as Core
    participant L1 as L1 Cache
(~4 cycles)
    participant L2 as L2 Cache
(~12 cycles)
    participant L3 as L3 Cache
(~35 cycles)
    participant RAM as RAM
(~250 cycles)

    CPU->>L1: Request data
    alt Hit in L1
        L1-->>CPU: Return data
    else Miss in L1
        L1->>L2: Fetch data
        alt Hit in L2
            L2-->>L1: Return data
            L1-->>CPU: Return data
        else Miss in L2
            L2->>L3: Fetch data
            alt Hit in L3
                L3-->>L2: Return data
                L2-->>L1: Return data
                L1-->>CPU: Return data
            else Miss in L3
                L3->>RAM: Fetch data
                RAM-->>L3: Return data
                L3-->>L2: Return data
                L2-->>L1: Return data
                L1-->>CPU: Return data
            end
        end
    end

When a miss occurs in L1, it is not the L1 cache that autonomously goes to fetch data from L2. What detects the miss and fires the request is the cache controller, a hardware circuit embedded in the processor. The core stays blocked waiting. The data climbs the hierarchy in reverse. If it was in RAM, it passes through L3, then L2, and finally reaches L1, filling each level it passes through. This means that, after a complete miss all the way to RAM, all intermediate levels hold a copy of the data, which speeds up future accesses to the same content by other cores or by the same code sequence.

When L1 is full and needs to accommodate a newly arrived cache line, it must evict another. That decision is made by the replacement policy. This is where the Belady algorithm enters, not as a real mechanism, but as a theoretical reference. Belady proved that the optimal algorithm is to evict the line that will be referenced furthest in the future, but this requires knowledge of the future, which makes it impractical. What real processors use is LRU (Least Recently Used), which evicts the data that has not been accessed for the longest time. LRU is an approximation of the Belady optimum using only past information. If a piece of data has not been requested recently, the probability of it being requested soon is low.

In practice, L1 and L2 use pseudo-LRU variants, because implementing true LRU in hardware at the speed of those levels would be too costly in silicon area. The entire process is transparent to the programmer. The hardware makes all the fetch and eviction decisions automatically, without any line of code needing to worry about which level of the hierarchy is being accessed.

What makes a workload cacheable

The effectiveness of cache depends on a metric called hit ratio, which is the proportion of accesses that find the data in the cache without needing to fetch it from the source. A hit ratio of 90% means that only 10% of accesses need to go to the slower level. A hit ratio of 50% means half of all accesses still pay the full cost.

The hit ratio of a system is determined primarily by the distribution of its accesses. Systems whose traffic follows a power law distribution, where few items concentrate the majority of accesses, are naturally favorable to cache. A small set of data covers a large fraction of total traffic, so a small cache can achieve a high hit ratio.

Systems with a uniform distribution, where all data has the same probability of being accessed, are unfavorable to cache. To maintain a high hit ratio in this type of workload, the cache would need capacity close to the total size of the data, which eliminates the economic advantage of the hierarchy.

In practice, the majority of web application and database workloads are closer to the power law distribution than to the uniform distribution. Popular products, recent articles, active user profiles, and system configurations all tend to be accessed with disproportionate frequency relative to the total set. That is what makes cache effective in most scenarios where it is applied.

There is also a temporal dimension beyond frequency distribution. The hit ratio depends on how long data remains valid in the cache relative to the rate of updates at the source. Data that rarely changes, like system configurations or product category lists, can stay in cache for long periods without risk of going stale. Data that changes frequently, like account balances or real-time inventory, requires short expiration times (TTL) or active invalidation strategies, which reduces the effective hit ratio.

The hierarchy beyond hardware

The same principle that justifies the L1 cache inside the processor justifies every cache layer along the path between a server and a user. The logic is identical in all cases. Data that needs to be accessed frequently must be kept as close as possible to whoever uses it, to minimize latency and access cost.

At the operating system level, there is the page cache, which keeps in RAM the pages of files recently read from disk. When a program reads a file, the operating system stores the content in memory. On the next read of the same file, the data is already available without a new disk read. Linux, for example, uses all available RAM not being used by processes for the page cache, and releases it automatically when a process needs more memory.

At the application level, query caching against the database works the same way. The result of a query is stored in memory, and subsequent requests for the same result are served directly, without executing the query again. That is exactly what we described in the previous article with Redis.

At the network level, CDNs — Content Delivery Networks — apply the same principle at global scale. Akamai, founded in 1998 by Tom Leighton and Daniel Lewin from MIT research, was the first company to commercialize this concept at scale. The central insight of Leighton and Lewin was that content stored close to the end user can be accessed without traveling a long distance across the internet. Instead of serving all users worldwide from a single origin server, content is replicated across hundreds or thousands of geographically distributed servers. When a user in Brazil accesses a video hosted by an American company, the video is served from a server in Brazil or South America, not from the origin in the United States.

The latency difference between fetching data from a server 20 km away versus fetching from the other side of the planet is measured in milliseconds, but that value is perceptible to the user and significant for systems that make many chained requests. A modern web page can make dozens of requests to different resources before it is fully loaded. If each of those requests adds 100 ms of intercontinental latency, the total load time grows quickly.

The relationship between latency, throughput, and contention

These three concepts describe distinct aspects of the behavior of a system under load, and understanding them separately helps diagnose problems with greater precision.

Latency is the time it takes for an individual operation to complete, measured from the perspective of the requester. Throughput is the number of operations a system can complete per unit of time. Contention, the phenomenon that ran through the previous article in this series, is the competition among multiple processes for the same limited resource.

The relationship between the three is captured by Little's Law, formulated by John Little in 1961 and published in Operations Research. The law states that the average number of items in a system equals the average throughput multiplied by the average latency. In other words, if latency increases and throughput remains constant, the number of requests accumulated inside the system grows proportionally.

This explains the progressive degradation pattern described in the previous article. When the latency of database queries increased due to higher volume, connections stayed open longer. More connections open at the same time means more contention for the connection pool. More contention increases latency further. The cycle feeds itself.

The cache breaks this cycle by reducing the latency of a fraction of requests. When 80% of requests are served in under 1 millisecond by Redis, instead of 100 to 200 milliseconds by the database, the average number of requests being processed simultaneously drops drastically. Contention decreases. The connection pool frees up. The system recovers the capacity to absorb the incoming volume.

"The long-run average number of items in a system equals the long-run average rate of arrival multiplied by the average time an item spends in the system."

— John D. C. Little, A Proof for the Queuing Formula: L = λW, Operations Research, 1961

Why cache exists at every layer

The cache hierarchy described here, from L1 in the processor to the edge server in the CDN, is not a coincidence of design. It is the application of the same principle in contexts with different orders of magnitude of latency and different units of data.

In the processor, the unit is a cache line of 64 bytes, and the latency being avoided is on the order of hundreds of nanoseconds. In the web application, the unit is the result of a query or a user session, and the latency being avoided is on the order of tens of milliseconds. In the CDN, the unit is an image file, a JavaScript script, or a video segment, and the latency being avoided is intercontinental, on the order of hundreds of milliseconds.

In all cases, the structure is the same. There is an origin that has all the data but is slow or distant. There is a cache that has a subset of the data but is fast and close. The system tries to serve requests from cache when possible, and goes to the origin only when necessary. And the effectiveness of this arrangement depends on how much the data access pattern exhibits temporal or spatial locality, or both.

What changes between levels is the proportion between the cache size and the total size of the data. In the processor, L1 has tens of kilobytes to cover an address space of gigabytes, a ratio of one to millions. In a web application, a Redis instance with tens of gigabytes covers a database with terabytes, a ratio of one to hundreds. In a CDN, edge servers store terabytes to cover an origin volume that can be much larger. At every level, cache needs only a small fraction of the origin to be effective. This is only possible because the access pattern is not uniform. The working set, at any scale, is always smaller than the total set.

Denning captured this universality in a later article, revisiting decades of application of the principle he had formalized.

"Locality is not specific to programs. It appears in file systems, databases, networks, and human behavior. It is a principle of nature."

— Peter Denning, The Locality Principle, Communications of the ACM, 2005

What this principle means for system design

Understanding the memory hierarchy and the locality principle changes the way one thinks about system design. A system that ignores these principles will end up paying costs that could have been avoided. A system that consciously exploits them can achieve far superior performance with the same hardware resources.

A concrete example is a system that serializes a user's complete object into a single Redis key, including history, preferences, and profile data. That design ignores spatial locality. If the application frequently needs only the name and avatar to render a header, it loads and deserializes the entire blob on every request. The cache is in use, but the data design works against the principle the cache exploits. The cost shows up as deserialization latency and memory consumption, not at the database, which makes the diagnosis harder than in a scenario with no cache at all.

In practice, this translates into a few questions worth asking when designing any system that handles data. What is the expected access pattern? Is the data distribution concentrated or uniform? How frequently are the same pieces of data requested in sequence? When one piece of data is requested, what other data tends to be requested alongside it?

These questions have no universal answers. They depend on the problem domain, the user volume, and the specific behavior of the application. Without these questions, any cache implementation is a bet, not an architectural decision.

In the next article in this series, the focus shifts from theory to the concrete architectural patterns that determine how cache integrates with the rest of the system. Cache-aside, read-through, write-through, and write-behind are patterns with distinct trade-offs, more or less suitable depending on the consistency and availability requirements of the system. Understanding why cache works, as we have seen here, is what makes it possible to choose the right pattern for each context.

Search articles