Cache Performance for SPEC CPU2000 Benchmarks

Version 2.0

January 2002

Jason F. Cantin
Department of Electrical and Computer Engineering
1415 Engineering Drive
University of Wisconsin-Madison
Madison, WI 53706-1691
jcantin@ece.wisc.edu
http://www.jfred.org

Mark D. Hill
Department of Computer Science
1210 West Dayton Street
University of Wisconsin-Madison
Madison, WI 53706-1685
markhill@cs.wisc.edu
http://www.cs.wisc.edu/~markhill

http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data


Abstract

The SPEC CPU2000 benchmark suite (http://www.spec.org/osg/cpu2000) is a collection of 26 compute-intensive, non-trivial programs used to evaluate the performance of a computer's CPU, memory system, and compilers. The benchmarks in this suite were chosen to represent real-world applications, and thus exhibit a wide range of runtime behaviors. On this webpage, we present functional cache miss ratios and related statistics for the SPEC CPU2000 suite. In particular, split L1 cache sizes ranging from 1KB to 1MB with 64B blocks and associativities of 1, 2, 4, 8 and full. Most of this data was collected at the University of Wisconsin-Madison with the aid of the Simplescalar toolset (http://www.simplescalar.org).


Contents

Top of page


Methodology

All functional data was collected with simulators from the Alpha version of the Simplescalar toolset, version 3.0. These include sim-cache, sim-cheetah, and sim-outorder. Some of these simulators were modified for the task (for example, sim-cheetah was modified to handle programs longer than 2 billion instructions). For interval cache data, the simulators were modified to print stats every 100 million executed instructions. A combination of Perl and Tcsh scripts were used to launch, manage, and process the results of these simulations.

All benchmarks were compiled statically with heavy optimization for the Alpha AXP instruction set. Optimizations were targeted at the Alpha 21264 processor implementations, and include prefetches, square-root instructions, byte/word memory operations, and no-ops for alignment. All benchmarks were run to completion with all reference inputs, with three exceptions. Two of the data sets for Perl (253.perlbmk) required new processes to be spawned, which is not supported by the Simplescalar tools at this time. One of the data sets for 175.vpr does not produce correct results due to an undocumented bug in the Simplescalar emulation of the Alpha ISA. The benchmarks simulated execute over 7 trillion instructions for the reference input sets. Generating the functional L1 miss-ratio tables resulted in over 400 trillion simulated instructions. The total simulation load for all functional and timing-based simulations to be reported here totals 30 CPU-years.

All simulations were carried out on a combination of x86/Linux machines and Alpha/Tru64 servers in the University of Wisconsin-Madison's Computer Science Department and ECE Department. The majority of this load was managed by the Condor system, which distributed these jobs to vacant machines throughout the CS building.

All cache configurations simulated had 64B blocks for both L1 and L2 caches. All data reported here is for the LRU replacement policy, though data for other replacement policies was collected and may be placed on this site soon. This data does not include operating system effects, and caches were not flushed periodically nor on system calls.

Top of page


Benchmarks Simulated

We have collected data for all 26 benchmarks with reference inputs: 12 integer, 14 floating-point.

Benchmark Language Type Category In SPEC95?
164.gzip C Integer Compression No
175.vpr C Integer FPGA Circuit Placement and Routing No
176.gcc C Integer C Compiler Yes
181.mcf C Integer Combinatorial Optimization No
186.crafty C Integer Game Playing: Chess No
197.parser C Integer Word processing No
252.eon C++ Integer Computer Visualization No
253.perlbmk C Integer PERL Programming Language Yes
254.gap C Integer Group Theory, Interpreter No
255.vortex C Integer Object-oriented Database Yes
256.bzip2 C Integer Compression No
300.twolf C Integer Place and Route Simulator (CAE) No
168.wupwise Fortran77 Floating-Point Physics, Quantum Chromodynamics No
171.swim Fortran77 Floating-Point Shallow Water Modeling Yes
172.mgrid Fortran77 Floating-Point Multi-grid Solver: 3D Potential Field Yes
173.applu Fortran77 Floating-Point Parabolic/Elliptic Partial Diff. Eqns Yes
177.mesa C Floating-Point 3-D Graphics Library No
178.galgel Fortran90 Floating-Point Computational Fluid Dynamics No
179.art C Floating-Point Image Recognition / Neural Nets No
183.equake C Floating-Point Seismic Wave Propagation No
187.facerec Fortran90 Floating-Point Image Processing: Face Recognition No
188.ammp C Floating-Point Computational Chemistry No
189.lucas Fortran90 Floating-Point Number Theory / Primality Testing No
191.fma3d Fortran90 Floating-Point Finite-Element Crash Simulation No
200.sixtrack Fortran77 Floating-Point High Energy Nuclear Physics Accelerator Design No
301.apsi Fortran77 Floating-Point Meteorology: Pollutant Distribution Yes

Top of page


Summary Data

The following summary information is independant of cache size and associativity simulated. The third and fifth columns refer to the ratio of instruction fetches and data references that are to a unique 64B block (i.e., the data was not obtained in the last cache access). For example, one instruction cache access returns a block of 16 instructions, many of which may be executed before a different block must be accessed (typically 10 for these benchmarks). This data was obtained by simulating caches with a single 64B block.

Benchmark Instructions I-Access/Inst Data Refs D-Access/Ref Refs/Inst % User Inst
164.gzip 478,636,174,329 0.1138 142,700,878,428 0.6451 0.2981 99.9
175.vpr 84,125,622,844 0.1164 37,067,564,576 0.7811 0.4406 99.8
176.gcc 243,597,914,726 0.1295 116,093,336,744 0.6229 0.4766 98.9
181.mcf 61,870,158,860 0.1579 23,056,352,854 0.6465 0.3727 99.9
186.crafty 191,882,992,412 0.0747 70,222,383,696 0.1291 0.3660 99.9
197.parser 546,769,649,600 0.1243 190,517,359,797 0.6119 0.3484 99.9
252.eon 239,768,148,508 0.1070 118,246,210,844 0.5680 0.4932 99.9
253.perlbmk 143,122,956,639 0.1163 61,829,661,201 0.5805 0.4320 99.9
254.gap 213,813,801,949 0.1198 80,924,423,445 0.6642 0.3785 99.9
255.vortex 390,700,613,872 0.1310 161,133,019,186 0.6672 0.4124 99.9
256.bzip2 377,370,326,800 0.1179 145,002,261,443 0.6787 0.3842 99.9
300.twolf 346,489,363,383 0.1025 111,857,479,345 0.7525 0.3228 99.9
168.wupwise 349,623,875,977 0.0938 107,613,170,820 0.5629 0.3078 99.9
171.swim 225,830,970,951 0.0685 74,341,437,755 0.9314 0.3292 99.9
172.mgrid 419,156,008,460 0.0638 153,909,315,484 0.9151 0.3672 99.9
173.applu 223,883,653,813 0.0641 85,459,068,028 0.7993 0.3817 99.9
177.mesa 281,775,068,600 0.1086 108,712,910,562 0.5886 0.3858 99.8
178.galgel 409,366,700,368 0.1008 178,742,879,478 0.8742 0.4366 99.9
179.art 86,834,976,688 0.1514 30,279,186,530 0.9089 0.3487 99.9
183.equake 131,518,705,120 0.0886 58,248,603,550 0.7544 0.4429 99.9
187.facerec 211,027,395,856 0.0857 66,872,909,521 0.6323 0.3169 99.9
188.ammp 326,549,217,724 0.0833 125,189,421,217 0.6624 0.3834 99.8
189.lucas 142,398,814,292 0.0707 31,507,111,538 0.7807 0.2213 99.9
191.fma3d 268,361,331,300 0.0797 118,043,791,674 0.7015 0.4399 99.9
200.sixtrack 470,950,788,817 0.0683 116,965,122,302 0.7646 0.2484 98.2
301.apsi 347,923,962,507 0.0798 129,508,475,036 0.7299 0.3722 99.8
Int Total 3,318,147,723,922 0.1189 1,258,650,931,559 0.6376 0.3793
Int Mean 276,512,310,327 0.1176 104,887,577,630 0.6123 0.3938
FP Total 3,895,201,470,473 0.0828 1,385,393,403,495 0.7560 0.3557
FP Mean 278,228,676,462 0.0862 98,956,671,679 0.7576 0.3559
Ovrl Total 7,213,349,194,415 0.1001 2,644,044,335,048 0.6968 0.3665
Ovrl Mean 277,436,507,478 0.1007 101,694,012,887 0.6905 0.3734

Note: For columns that already contain ratios, the "Total" represents the sum of all the numerators divided by the sum of all the denominators, and the "Mean" represents the arithmetic means of the computed ratios.

Top of page


Table Format

All miss-ratio tables are in ASCII text format, generated with Perl scripts. They include the name of the file, the name of the benchmark, the command line for the benchmark, the number of instructions, the number of data references, miss-ratios (misses/reference) for a set of cache sizes and associativities, and compulsory miss rates. For each benchmark and data set, miss ratios are rounded to 8 decimal places. The computed arithmetic means for each benchmark are rounded to 7 digits, and the overall means are rounded to 6 digits. Miss ratios are reported for sizes of 1KB - 1MB, with associativities of 1-way, 2-way, 4-way, 8-way, and full. In all cases the block size was 64B's and the replacement policy was LRU. Compulsory miss-rates were measured as the miss-rate of a fully-associative 256MB cache with no flushing, and rounded to 12 places. Note that there is sufficient data to calculate the 3C's for the various configurations. See the example below (overall arithmetic mean for selected benchmarks)

--------------------------------------------------------------------------
|                    Block size: 64 bytes, Repl: LRU                     |
|------------------------------------------------------------------------|
|               Arithmetic Mean for Instruction References               |
|------------------------------------------------------------------------|
|       |                          Associativity                         |
| Size  |----------------------------------------------------------------|
|       |      1     |      2     |      4     |      8     |    full    |
|-------+------------+------------+------------+------------+------------|
|    1K | 0.040115-- | 0.038059-- | 0.038609-- | 0.038631-- | 0.038770-- |
|    2K | 0.028248-- | 0.026708-- | 0.026033-- | 0.026023-- | 0.026006-- |
|    4K | 0.019655-- | 0.017775-- | 0.017586-- | 0.017514-- | 0.017421-- |
|    8K | 0.013024-- | 0.011229-- | 0.010171-- | 0.010013-- | 0.009931-- |
|   16K | 0.007394-- | 0.004766-- | 0.003666-- | 0.003405-- | 0.004296-- |
|   32K | 0.003237-- | 0.001233-- | 0.000651-- | 0.000388-- | 0.000239-- |
|   64K | 0.001060-- | 0.000360-- | 0.000127-- | 0.000049-- | 0.000016-- |
|  128K | 0.000454-- | 0.000148-- | 0.000014-- | 0.000004-- | 0.000002-- |
|  256K | 0.000090-- | 0.000011-- | 0.000002-- | 0.000001-- | 0.000001-- |
|  512K | 0.000009-- | 0.000003-- | 0.000001-- | 0.000000-- | 0.000001-- |
| 1024K | 0.000000-- | 0.000000-- | 0.000000-- | 0.000000-- | 0.000000-- |
--------------------------------------------------------------------------
Compulsory: 0.0000000416--


--------------------------------------------------------------------------
|                    Block size: 64 bytes, Repl: LRU                     |
|------------------------------------------------------------------------|
|                  Arithmetic Mean for Data References                   |
|------------------------------------------------------------------------|
|       |                          Associativity                         |
| Size  |----------------------------------------------------------------|
|       |      1     |      2     |      4     |      8     |    full    |
|-------+------------+------------+------------+------------+------------|
|    1K | 0.275311-- | 0.232072-- | 0.207868-- | 0.191097-- | 0.185660-- |
|    2K | 0.191787-- | 0.155995-- | 0.137516-- | 0.123602-- | 0.115772-- |
|    4K | 0.145548-- | 0.114026-- | 0.105337-- | 0.094777-- | 0.089413-- |
|    8K | 0.106719-- | 0.085133-- | 0.078486-- | 0.074013-- | 0.069963-- |
|   16K | 0.082798-- | 0.067679-- | 0.064007-- | 0.061553-- | 0.059314-- |
|   32K | 0.069504-- | 0.056942-- | 0.055286-- | 0.053659-- | 0.052217-- |
|   64K | 0.060102-- | 0.052060-- | 0.050989-- | 0.049836-- | 0.048541-- |
|  128K | 0.051134-- | 0.048766-- | 0.048341-- | 0.046895-- | 0.045834-- |
|  256K | 0.046695-- | 0.044774-- | 0.044566-- | 0.044497-- | 0.043546-- |
|  512K | 0.041238-- | 0.040808-- | 0.041690-- | 0.041878-- | 0.040885-- |
| 1024K | 0.033697-- | 0.032618-- | 0.033644-- | 0.034391-- | 0.034436-- |
--------------------------------------------------------------------------
Compulsory: 0.0000293378--

For example, for a 4KB direct-mapped L1 data cache with 64-Byte blocks, approximately 146 out of every 1,000 data references miss. When neglecting the operating system, 29 out of every 1,000,000 data references cause a compulsory miss.

Top of page


Miss Ratio Tables

Top of page


Experimental Error

The miss ratios were calculated from data collected by functional, user-mode simulations of optimized benchmarks. As a result, the cache miss ratios reported above may not be representative of a real platform. A few sources of error are discussed below.

First, only primary misses were counted by the simulator. Once a reference missed in the cache, the data was loaded and all subsequent accesses to the line hit. A modern processor may also experience secondary misses, or references to data that has yet to be loaded from a prior cache miss. There is a nonzero miss latency, and a real processor may execute other instructions while waiting for the data. The sequential model used in functional simulations is optimistic in this respect.

Second, a modern processor will have optimizations that affect cache performance. Hardware prefetching of instructions and data can have the positive effect of reducing the number of cache misses. However, prefetching can also cause cache pollution. Further, speculative execution can result in increased memory traffic for speculatively issued loads, and I-cache pollution from incorrect branch predictions. This also makes the results optimistic.

Third, the operating system was ignored. System calls cause additional cache misses to bring in OS code and data, and in doing so they replace cache lines from the user program. This increases the number of conflict and capacity misses for the user program in a real system. Since the additional misses from OS intervention were not modeled, our results are optimistic. One possibility is to flush the caches on system calls. However, this is the other extreme, and would have made it impossible to measure the compulsory miss rates.

Fourth, all prefetch instructions (loads to R31) were treated as normal references. All were executed, and references from prefetch instructions were included in the overall statistics. Although prefetch instructions may prevent (or reduce the impact of) cache misses from instructions in the original code, the misses still occur (just sooner). However, prefetch instructions increase the overall hit ratio because the subsequent loads and stores that hit in the cache add to the overall hit count. One possibility is to ignore prefetch instructions altogether (the Alpha ISA allows this). Another possibility is to count the misses from the prefetches, but not count them as instructions.

Fifth, the benchmarks were optimized for an Alpha 21264 processor. The binaries may have been tuned to perform well with the 21264 cache hierarchy (64K 2-way L1 caches). Ideally, the binary should not favor a particular cache configuration. Further, the binary contains no-ops for alignment and steering of dependant operations in the clustered microarchitecture of the 21264. These no-ops increase the overall instruction count for the functional simulation.

Top of page


Future Work

Top of page


Related Work

Top of page


Acknowledgements

Top of page


Publications

Top of page


Disclaimer

Data in this directory is correct to the best of our knowledge. However, we provide it, *AS IS* without an expressed or implied warranty, and we accept no responsibility for the consequences of the use or misuse of this data.

Top of page


Revision History

Top of page


Last updated January 2002, jfc. Report any dead links or errors to jfc.