RandomAccess
GUPS (Giga UPdates Per Second)
Random memory performance
often maps directly to application performance and application development
time. A small percentage of random memory accesses (cache misses) in an
application can significantly affect the overall performance of that
application. Current processor trends favor longer cache lines and stride-1
performance at the expense of random stride access performance. As random
memory access becomes increasingly more expensive relative to processor
operations, the need arises for a benchmark that discriminates on the basis of
random memory access performance.
The effects of even a small percentage of random
memory accesses are illustrated in figures 1 and 2, where outputs from the MAPS
(Memory Access Patterns) architecture profiling routine show measured memory
bandwidth for stride-1 load, random memory load, and various percentages of
random memory access for a Compaq Alphaserver (Lemieux at the Pittsburgh Supercomputing Center) and an IBM
SP-3 (Blue Horizon at the San Diego Supercomputing Center (SDSC))
respectively. These charts are from A.
Snavely, et. al., at the SDSC Performance Modeling and Characterization (PMaC) Laboratory.

Figure 1 MAPS on a Compaq Alphaserver

Figure 2 MAPS on an IBM SP-3
GUPS (Giga UPdates per Second) is a measurement that profiles the
memory architecture of a system and is a measure of performance similar to MFLOPS. The HPCS HPCchallenge RandomAccess benchmark is
intended to exercise the GUPS capability of a system, much like the LINPACK
benchmark is intended to exercise the MFLOPS capability of a computer. In each case, we would expect these
benchmarks to achieve close to the "peak" capability of the memory
system. The extent of the similarities between RandomAccess and LINPACK are
limited to both benchmarks attempting to calculate a peak system
capability. We are interested in knowing
the GUPS performance of both entire systems and system subcomponents --- e.g.,
the GUPS rating of a distributed memory multiprocessor the GUPS rating of an
SMP node, and the GUPS rating of a single processor. While there is typically a scaling of FLOPS
with processor count, a similar phenomenon may not always occur for GUPS.
GUPS is calculated by
identifying the number of memory locations that can be randomly updated in one
second, divided by 1 billion (1e9). The term “randomly” means that there is
little relationship between one address to be updated and the next, except that
they occur in the space of ½ the total system memory. An update is a read-modify-write operation on
a table of 64-bit words. An address is
generated, the value at that address read from memory, modified by an integer
operation (add, and, or, xor) with a literal value, and that new value is
written back to memory.
Select the memory size to be
the power of two such that 2n ≤½ of the
total memory. Each CPU operates on its
own address stream, and the single table may be distributed among nodes. The
distribution of memory to nodes is left to the implementer. A uniform data distribution may help balance
the workload, while non-uniform data distributions may simplify the
calculations that identify processor location by eliminating the requirement
for integer divides. A small (less than 1%) percentage of missed updates are
permitted.
When implementing a benchmark
that measures GUPS on a distributed memory multiprocessor system, it may be
required to define constraints as to how far in the random address stream each
node is permitted to "look ahead".
Likewise, it may be required to define a constraint as to the number of
update messages that can be stored before processing to permit multi-level
parallelism for those systems that support such a paradigm. The limits on “look ahead” and “stored
updates” are being implemented to assure that the benchmark meets the intent to
profile memory architecture and not induce significant artificial data
locality. For the purpose of measuring GUPS, we will stipulate that each thread
is permitted to look ahead no more than 1024 random address stream samples with
the same number of update messages stored before processing.
Basic RandomAccess Benchmark
Definition
Let T[ ] be a table of size 2n
Let {Ai} be a stream of 64-bit integers of length
2n+2 generated by the primitive polynomial over GF(2)[1], X63+ X3+ X+1.
For each ai, set T[ai <63,
64-n>] = T[ai <63, 64-n>] + ai
Where
+ denotes
addition in GF(2) i.e. exclusive “or”
ai<j, i> denotes the sequence of bits within ai e.g.
<63, 64-n> are the highest n bits
The parameter n
is defined such that:
n is the largest power of 2
that is less than or equal to half of main memory
Constraints on the look ahead and storage before processing on
distributed memory multi-processor systems is limited to 1024 per “node”. Operations performed in parallel (either due
to look ahead or node parallelization) are allowed to be “unlocked” and
therefore may result in a small percentage of updates being left unperformed
due to race conditions. (If two parallel
operations read the same location, update the value, and then each store the
resulting values — one update is lost.)
This percentage may not exceed 1%.
The
basic RandomAccess definition is described graphically in figure 3.

Figure
3 RandomAccess Description
HPCchallenge RandomAccess
Benchmarks
RandomAccess
will be run in three variants:
We are supplying two versions of RandomAccess in the HPCchallenge
Benchmark code distribution:
1. A local version
that can be run either on a node or single processor. Figures 4 and 5 describe two potential
implementations:
§
Sequential RandomAccess (figure 4)
§
Vector or multi-threaded RandomAccess (figure 5)
2. A global version
that runs using MPI-1 that is mandatory to run.
Figures 6 and 7 describe the implementations for power of two and non-power of
two numbers of processors.
§
MPIRandomAccess
Power of Two Case (figure 6)
§
MPIRandomAccess
Non-Power of Two Case (figure 7)
We are also supplying an additional
version written in Unified Parallel C (UPC)

Figure 4 Sequential RandomAccess Implementation

Figure 5 Vector or Multi-Threaded RandomAccess Implementation

Figure 6 MPIRandomAccess
Power of Two Case Implementation

Figure 7 MPIRandomAccess
Non-Power of Two Case Implementation
The Supplied MPI-1
RandomAccess Code
The
supplied MPI-1 code generates the input stream {Ai} on all processors and the global table has been
distributed as uniformly as possible to balance the workload and minimize any
Amdahl fraction. This code does not exploit “look-ahead”. Addresses are sent to the appropriate
processor where the table entry resides as soon as each address is
calculated. Updates are performed as
addresses are received. Each message is
limited to a single 64 bit long integer containing element ai from {Ai}.
Local offsets for T[ ] are extracted by the destination processor
If
the number of processors is equal to a power of two, then the global table can be
distributed equally over the processors.
In addition, the processor number can be determined from that portion of
the input stream that identifies the address into the global table by masking
off log2(p)
bits in the address ― ai
<63, (64-log2(p)), > .
If the number of processors is not equal to a power of
two, then the global table cannot be equally distributed between
processors. In the MPI-1 implementation
provided, there has been an attempt to minimize the differences in workloads
and the largest difference in elements of T[ ] is one. The number of values in the input stream
generated by each processor will be related to the number of global table
entries on each processor.
The MPI-1 version of RandomAccess treats the potential
instance where the number of processors is a power of two as a special case,
because of the significant simplifications possible because processor location
and local offset can be determined by applying masks to the input stream
values. The non power of two case uses
an integer division to determine the processor location. The integer division will be more costly in
terms of machine cycles to perform than the bit masking operations.
The Supplied UPC
RandomAccess Code
HPCchallenge benchmark users are encouraged to run the
UPC code — in addition to the portable HPCchallenge
MPI-1 RandomAccess distribution code — if systems are capable of supporting
this language in an efficient manner.
These codes may be the basis for architecture-dependent optimized
RandomAccess runs. The UPC code provides a data view that is similar to the
sequential program. In the UPC code,
table T[ ] is shared and distributed among threads in a round-robin fashion
using a block size of 1, which distributes data evenly. No power-of-two distinctions are
required.
Optimizing
RandomAccess
While it will be mandatory to run the HPCchallenge
Benchmark code distribution ― including the MPI-1 RandomAccess version ― we encourage users to modify any of the kernels to
optimize them for your architecture, software, middleware, and interprocessor
features. The supplied UPC codes may be the basis for architecture-dependent optimized RandomAccess
runs. Potential modifications
to RandomAccess include, but are not limited to:
There are many other modifications possible. It is important that modified codes remain
true to the intent of the basic requirements of the RandomAccess benchmark
description with (1) size of the table T[ ] being a power of two and
approximately ½ the global memory and with (2) look-ahead and storage
constraints being followed. Specifically,
attempts to improve memory access characteristics by reordering the
address stream — so that all updates are “local” — violates the rules of the
RandomAccess benchmark.
We would appreciate if users would share their optimized
codes so that we can evaluate the innovation expressed in the implementation
and we can determine whether the version of the code maintains the spirit of
the RandomAccess benchmark. It is not
mandatory to provide a version of the code, although we will require at least a
description of the modifications to post the benchmark performance times. We would also encourage users to estimate the
staff hours required to make the changes to the optimized version of RandomAccess. This helps gauge the "how hard was it to
get better performance". Even order
of magnitude measures would be useful to help us examine issues relating to
tradeoffs in productivity and performance.
We hope that active involvement by the HPC community will develop highly
optimized versions of RandomAccess similar to the highly optimized version of
High Performance LINPACK (HPL).
Contacts
RandomAccess is one of the DARPA HPCS Discrete Math
Benchmarks. Contact David Koester
<dkoester@mitre.org> or Bob Lucas <rflucas@isi.edu> if you have
questions.
[1] GF(2) (Galois Field of order 2) The elements of GF(2)
can be represented using the integers 0 and 1, i.e., binary operands