Back to Home Page

Random Access Memory Latency

by Bob Day (bobday.nh@verizon.net)

Copyright (C) July, 2004 by Bob Day. All rights reserved.

Introduction

Memory latency.  Never in all my surfing of the Internet have I encountered so much confusion and misinformation on a subject.  Even the order of the timings quoted for SDRAM memory is a source of confusion.  Take "3-2-2". These are the numbers of memory cycles required to allow for various events inside memory to be completed.  They correspond to timings within the memory.  Many sources give the order of these timings as CAS Latency (tCAC, or tCL), Row Column Delay (tRCD), and then Row Precharge time (tRP).  Many other sources reverse the latter two items.  Confusion.  "tRAC" is variously quoted as meaning either "row access time" or "random access time".  Some define memory latency simply as the CAS latency (tCAC), others define it as tRAC, still others add tRCD to tCAC and call that the latency, and that may not be all.  Few seem to include tRP, which must be included in the latency when the memory has to switch to a non-charged memory row.  The major problem with all of these definitions is that all of them are concerned only with the internal timings of the memory module itself and leave out the interactions between the memory and the CPU.  These interactions can include the L1 and L2 cache miss latencies and the translation lookaside buffer (TLB) latency.  Consequently, all of them yield latency times far short (around 90ns or less) of the actual random access round-trip latency measured from a request for data by the CPU until the CPU can use the data.

Here's the latency I wanted to know: The interval from the issuing of an instruction by the CPU to read a 4 byte integer from a random location in memory until the time the CPU can use that data (say in an addition operation).  A simple definition, but a few things about it should be noted.  First, it's intended to be a somewhat real-world model of the pattern of memory access that would be encountered in accessing a hashed database, such as the tree of possible chess positions that debouch from the evaluation of the current position.  Second, I don't exclude the possibility that the data may already be in cache, or in the same row in main memory -- these possibilities are part of the real world.  Third, the latency as defined above is somewhat system dependent.  It depends on the amount of main memory under consideration, the sizes of the L1, L2, and TLB caches and other factors.  So in the results, I'll be careful to state the details of the systems on which the memory is installed.

The best source I was able to find on the Internet that examines something close to the kind of latency I wanted to study is given by this link:

http://www.ece.mtu.edu/faculty/btdavis/papers/tc_si.pdf


The Program

I wrote a program called MemLatency to compute the memory latency I defined above.  I'll describe how the program works in this section, and give some results in the next section.

First, the user enters a block size, a memory buffer size, and a number of iterations.  The block size is the size of a unit (record or structure) the user wishes to find the random access time for, and can be anywhere from 4 bytes to the size of the memory buffer.  The memory buffer size is the number of megabytes of memory the user wishes to assign to the program. In my runs of the program, I've usually entered 128 megabytes.  The number of iterations is the number of memory reads of blocks the program is to perform.  I usually run the program for at least 200,000 iterations.

The program divides the memory buffer into blocks of the size entered by the user, and then shuffles their order, so that, in each iteration a random block in the memory buffer is read.  Not only does this randomization implement our model, it also prevents the CPU from gaining efficiency by speculative execution of instructions or by being able to predict a pattern of memory access.  Consequently, the memory latency the program measures is less dependent on the CPU, and depends more closely on the characteristics of the memory system.

While running, the program raises it's priority to the highest level to prevent other programs from running and contaminating the results it computes.  Raising the priority does not lock out hardware interrupts, however, but I've found these to have a negligible effect on the results.  A consequence of raising the priority is that the computer on which the program is run will be "frozen" while the program is running.  For example, the mouse pointer won't move, a clock display won't advance, and keyboard input will be ignored.


Results

MemLatency computes the average time to read an entire random block of the specified size.
Here are the results for my two computers:

First computer: 1 GHz Celeron PIII CPU, 100MHz FSB, 256KB cache, 133MHz memory bus, 512MB PC133 CL2 ECC SDRAM.

MemLatency arguments:
       Memory buffer size: 128 megabytes.
       Number of iterations: 200,000

Results:
       Block size: 4 bytes.
       Random (4 byte) read latency: 189.2 nanoseconds.

       Block size: 1024 bytes.
       Random (1024 byte) block read latency: 2659.4 nanoseconds.


Second computer: 2.4 GHz Pentium 4, 533 MHz FSB, 512KB cache, 1066MHz effective memory bus speed, 256MB PC1066-34 ECC RDRAM.

MemLatency arguments:
       Memory buffer size: 128 megabytes.
       Number of iterations: 200,000

Results:
       Block size: 4 bytes.
       Random (4 byte) read latency: 146.9 nanoseconds.

       Block size: 1024 bytes.
       Random (1024 byte) block read latency: 759.4 nanoseconds.


To Obtain a Copy

To obtain a copy of MemLatency, download it from this website.

Back to Home Page