Cache Performance for SPEC CPU2000 Benchmarks

Version 3.0

May 2003

Jason F. Cantin
Department of Electrical and Computer Engineering
1415 Engineering Drive
University of Wisconsin-Madison
Madison, WI 53706-1691
jcantin@ece.wisc.edu
http://www.jfred.org

Mark D. Hill
Department of Computer Science
1210 West Dayton Street
University of Wisconsin-Madison
Madison, WI 53706-1685
markhill@cs.wisc.edu
http://www.cs.wisc.edu/~markhill

http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data


Abstract

The SPEC CPU2000 benchmark suite (http://www.spec.org/osg/cpu2000) is a collection of 26 compute-intensive, non-trivial programs used to evaluate the performance of a computer's CPU, memory system, and compilers. The benchmarks in this suite were chosen to represent real-world applications, and thus exhibit a wide range of runtime behaviors. On this webpage, we present functional cache miss ratios and related statistics for the SPEC CPU2000 suite. In particular, L1 instruction, L1 data, and L1 unified caches ranging from 1KB to 1MB with 64B blocks and associativities of 1, 2, 4, 8 and full. Prefetch operations were always executed, but results are posted both with and without them counted in the hit ratios. Most of this data was collected at the University of Wisconsin-Madison with the aid of the Simplescalar toolset (http://www.simplescalar.org).


Contents

Top of page


Methodology

All functional data was collected with modified simulators from the Alpha version of the Simplescalar toolset, version 3.0. The simulators were modified to simulate multiple caches at once, and report statistics every 1 billion instructions. Further, we modified the Simplescalar code to correctly distinguish between binding loads, prefetch instructions made from loads to R31, and the universal NOP composed from an unaligned quadword load to R31. A combination of Perl and Tcsh scripts were used to launch, manage, and process the results of these simulations.

All benchmarks were compiled statically with heavy optimization for the Alpha AXP instruction set. Optimizations were targeted at the Alpha 21264 implementations, and include prefetches, square-root instructions, byte/word memory operations, and no-ops for alignment. They do not contain profile-directed optimizations. All benchmarks were run to completion with all reference inputs, with three exceptions. Two of the data sets for Perl (253.perlbmk) required new processes to be spawned, which is not supported by the Simplescalar tools. One of the data sets for 175.vpr does not produce correct results due to an undocumented bug in the Simplescalar emulation of the Alpha ISA. The benchmarks simulated comprise over 7 trillion dynamic instructions for the reference input sets. Generating the functional L1 miss-ratio tables resulted in over 400 trillion simulated instructions. The total simulation load for all functional simulations to be reported here totals 7.6 CPU-years.

All caches simulated here used the LRU replacement policy. This data does not include operating system effects, and caches were not flushed on system calls. Thus, actual miss-rates will be higher than those reported here. See the section on experimental error below.

Top of page


Benchmarks Simulated

We have collected data for all 26 benchmarks with reference inputs: 12 integer, 14 floating-point. See the table below (Table 1).

Benchmark Language Type Category Inputs % User Time
164.gzip C Integer Compression 5 99.9
175.vpr C Integer FPGA Circuit Placement and Routing 2 99.8
176.gcc C Integer C Compiler 5 98.9
181.mcf C Integer Combinatorial Optimization 1 99.9
186.crafty C Integer Game Playing: Chess 1 99.9
197.parser C Integer Word processing 1 99.9
252.eon C++ Integer Computer Visualization 3 99.9
253.perlbmk C Integer PERL Programming Language 7 99.9
254.gap C Integer Group Theory, Interpreter 1 99.9
255.vortex C Integer Object-oriented Database 3 99.9
256.bzip2 C Integer Compression 3 99.9
300.twolf C Integer Place and Route Simulator (CAE) 1 99.9
168.wupwise Fortran77 Float Physics, Quantum Chromodynamics 1 99.9
171.swim Fortran77 Float Shallow Water Modeling 1 99.9
172.mgrid Fortran77 Float Multi-grid Solver: 3D Potential Field 1 99.9
173.applu Fortran77 Float Parabolic/Elliptic Partial Diff. Eqns 1 99.9
177.mesa C Float 3-D Graphics Library 1 99.8
178.galgel Fortran90 Float Computational Fluid Dynamics 1 99.9
179.art C Float Image Recognition / Neural Nets 2 99.9
183.equake C Float Seismic Wave Propagation 1 99.9
187.facerec Fortran90 Float Image Processing: Face Recognition 1 99.9
188.ammp C Float Computational Chemistry 1 99.8
189.lucas Fortran90 Float Number Theory / Primality Testing 1 99.9
191.fma3d Fortran90 Float Finite-Element Crash Simulation 1 99.9
200.sixtrack Fortran77 Float High Energy Nuclear Physics Accelerator Design 1 98.2
301.apsi Fortran77 Float Meteorology: Pollutant Distribution 1 99.8
Table 1

Top of page


Summary Data

The summary information below (Table 2) is independent of cache size and associativity. The prefetches are included in the counts/rates of loads and stores, since they were made from loads to R31. Means for each of the counts are arithmetic, while means for the per-instruction rates are harmonic. The integer and floating-point benchmarks are averaged separately, and then those means are averaged for the overall mean. This gives equal weight to each benchmark in a set, and equal weighting of integer and floating-point data.

Benchmark Instructions Loads/Stores Load/Store Rate Prefetches Prefetch Rate
164.gzip 478,636,153,300 106,339,774,183 0.222172 210,749,035 0.000440
175.vpr 84,068,682,462 34,290,404,520 0.407886 28,489,513 0.000339
176.gcc 242,873,996,717 76,347,070,522 0.314348 3,223,614,266 0.013273
181.mcf 61,867,398,195 21,224,890,959 0.343071 660,286,828 0.010673
186.crafty 191,882,992,463 63,216,831,368 0.329455 502,573 0.000003
197.parser 546,749,971,166 157,370,546,967 0.287829 7,148,027 0.000013
252.eon 239,766,530,848 107,716,995,277 0.449258 3,956,650,303 0.016502
253.perlbmk 407,741,332,530 178,536,860,186 0.437868 70,854,377 0.000174
254.gap 213,814,233,713 68,971,552,540 0.322577 24,459,785 0.000114
255.vortex 390,698,783,316 153,384,434,949 0.392590 1,975,965,046 0.005058
256.bzip2 377,370,320,254 133,004,113,614 0.352450 133,248,276 0.000353
300.twolf 346,484,742,706 97,313,583,371 0.280860 23,200,957 0.000067
168.wupwise 349,623,881,589 94,914,535,141 0.271476 8,264,448,086 0.023638
171.swim 225,830,975,667 74,301,504,811 0.329014 10,310,829,246 0.045657
172.mgrid 419,156,007,205 153,695,716,464 0.366679 9,254,489,893 0.022079
173.applu 223,883,656,387 85,299,162,334 0.380998 1,961,452,391 0.008761
177.mesa 281,694,536,771 100,831,728,471 0.357947 350,031,487 0.001243
178.galgel 409,366,708,755 177,448,399,787 0.433471 58,409,867,463 0.142683
179.art 86,831,419,686 28,608,040,144 0.329466 6,045,927,514 0.069628
183.equake 131,518,590,672 55,938,011,470 0.425324 126,096,577 0.000959
187.facerec 211,027,958,400 63,321,687,774 0.300063 1,832,311,610 0.008683
188.ammp 326,548,871,460 109,074,260,280 0.334021 1,455,221,518 0.004456
189.lucas 142,398,816,802 30,986,842,904 0.217606 2,134,334,895 0.014988
191.fma3d 268,370,708,343 115,831,207,504 0.431609 1,349,475,397 0.005028
200.sixtrack 470,949,157,978 108,391,446,800 0.230155 10,556,998,171 0.022416
301.apsi 347,924,060,406 124,663,297,530 0.358306 5,381,960,075 0.015469
Int Total 3,581,955,137,670 1,197,717,058,456 10,315,168,986
Int Mean 298,496,261,472 99,809,754,871 0.332290 859,597,415 0.000024
FP Total 3,895,125,350,121 1,323,305,841,414 117,433,444,323
FP Mean 278,223,239,294 94,521,845,815 0.326063 8,388,103,165 0.004987
Ovrl Total 7,477,080,487,791 2,521,022,899,870 127,748,613,309
Ovrl Mean 288,359,750,383 97,165,800,343 0.329147 4,623,850,290 0.000048
Table 2

The table below (Table 3) shows how often a new block must be obtained from the cache to satisfy the request. For example, one instruction cache access returns a block of 16 instructions, many of which may be executed before a different block must be obtained (typically 10 for these benchmarks).

This data was obtained by simulating caches with a single 64-Byte block, and may be more meaningful for implementations that can buffer blocks or merge requests. The means are computed in the same way as Table 2 above.

Benchmark I$-Access I$-Access Rate D$-Access D$-Access Rate U$-Access U$-Access Rate
164.gzip 54,470,768,520 0.113804 76,846,749,520 0.160554 264,464,152,198 0.552537
175.vpr 9,783,417,461 0.116374 26,450,333,116 0.314628 75,711,030,995 0.900585
176.gcc 31,450,635,056 0.129494 43,842,754,917 0.180516 181,733,952,330 0.748264
181.mcf 9,766,961,907 0.157869 13,989,421,988 0.226119 52,061,791,386 0.841506
186.crafty 21,974,406,397 0.114520 47,422,033,028 0.247140 145,207,265,223 0.756749
197.parser 67,980,418,633 0.124335 99,595,662,954 0.182159 373,494,790,264 0.683118
252.eon 25,652,922,924 0.106991 62,785,515,113 0.261861 235,089,422,313 0.980493
253.perlbmk 47,401,063,428 0.116253 105,176,984,033 0.257950 395,433,450,861 0.969814
254.gap 25,606,586,268 0.119761 49,105,988,033 0.229667 161,143,617,345 0.753662
255.vortex 51,181,968,365 0.131001 103,480,838,369 0.264861 348,800,683,706 0.892761
256.bzip2 45,086,528,330 0.119476 79,924,432,801 0.211793 302,917,293,906 0.802706
300.twolf 35,524,277,102 0.102528 73,552,792,605 0.212283 222,877,780,050 0.643254
168.wupwise 32,803,956,134 0.093826 54,126,522,525 0.154814 216,266,923,190 0.618570
171.swim 15,463,978,992 0.068476 69,220,294,011 0.306514 159,885,506,394 0.707987
172.mgrid 26,737,536,917 0.063789 140,740,025,878 0.335770 324,975,820,808 0.775310
173.applu 14,351,426,465 0.064102 68,246,688,564 0.304831 180,277,378,414 0.805228
177.mesa 30,602,545,093 0.108637 60,630,249,848 0.215234 225,901,437,732 0.801938
178.galgel 41,243,445,388 0.100749 155,587,844,182 0.380070 389,597,173,029 0.951707
179.art 13,144,486,961 0.151379 26,652,374,302 0.306944 69,084,709,620 0.795619
183.equake 11,658,423,445 0.088645 43,425,718,299 0.330187 119,186,908,018 0.906236
187.facerec 18,076,470,614 0.085659 40,091,054,768 0.189980 142,125,671,080 0.673492
188.ammp 27,201,946,160 0.083301 76,187,755,090 0.233312 241,195,708,949 0.738621
189.lucas 10,063,843,673 0.070674 24,380,575,286 0.171213 70,125,610,574 0.492459
191.fma3d 21,390,682,566 0.079706 81,415,531,600 0.303370 246,070,861,667 0.916907
200.sixtrack 32,179,807,823 0.068330 85,712,543,482 0.182000 240,948,906,996 0.511624
301.apsi 27,762,425,232 0.079794 92,471,252,932 0.265780 270,809,995,354 0.778359
Int Total 425,879,954,391 782,173,506,477 2,758,935,230,577
Int Mean 35,489,996,199 0.119680 65,181,125,539 0.221556 229,911,269,214 0.772763
FP Total 322,680,975,463 1,018,888,430,767 2,896,452,611,825
FP Mean 23,048,641,104 0.081825 72,777,745,054 0.243528 206,889,472,273 0.720977
Ovrl Total 748,560,929,854 1,801,061,937,244 5,655,387,842,402
Ovrl Mean 224,464,297,747 0.097197 427,475,625,765 0.232023 1,482,912,351,425 0.745973
Table 3

Top of page


Table Format

All miss-ratio tables (.tab files) are ASCII text. They include the name and command line for each benchmark; the number of instructions, data references, data prefetches, and the miss-ratios (misses/instruction, rounded to 9 places) for cache sizes ranging from 1KB to 1MB, and associativities of 1, 2, 4, 8, and full. The tables also contain compulsory miss-rates and access-rates. Both arithmetic means and harmonic means were computed for the data sets of each benchmark (rounded to 8 places). The means for each benchmark were then averaged together, and rounded to 7 places. In all cases the block size was 64B and the replacement policy was LRU (least recently used). Compulsory miss-rates were measured as the miss-rate of a 2-way set-associative 256MB cache with no flushing on system calls (rounded to 12 places). Access-rates were measured as the miss-rate for a direct-mapped, 64B cache --having just one block. Note that there is sufficient data to calculate the 3C's for the various configurations. See the example below

-----------------------------------------------------------------------------
| U-cache misses/inst: 584,975,927,483 unified refs (1.231289-/inst);         |
|-----------------------------------------------------------------------------|
| 264,464,152,198 U-cache 64-Byte block accesses (0.560264-/inst)             |
|-----------------------------------------------------------------------------|
|  Size |   Direct    |  2-way LRU  |  4-way LRU  |  8-way LRU  |  Full LRU   |
|-------+-------------+-------------+-------------+-------------+-------------|
|   1KB | 0.17096965- | 0.11586859- | 0.10006949- | 0.09539379- | 0.09356626- |
|   2KB | 0.09933510- | 0.07301168- | 0.06116419- | 0.05846171- | 0.06111943- |
|   4KB | 0.06756154- | 0.04373756- | 0.03517036- | 0.02650259- | 0.02410843- |
|   8KB | 0.05398704- | 0.02824148- | 0.02123935- | 0.02024346- | 0.01982071- |
|  16KB | 0.03316842- | 0.02309782- | 0.01727542- | 0.01709368- | 0.01694758- |
|  32KB | 0.02622252- | 0.01814185- | 0.01381846- | 0.01369148- | 0.01354134- |
|  64KB | 0.01397891- | 0.01160836- | 0.00835915- | 0.00821407- | 0.00807335- |
| 128KB | 0.00583968- | 0.00267375- | 0.00189210- | 0.00172267- | 0.00151421- |
| 256KB | 0.00343062- | 0.00054402- | 0.00040227- | 0.00038742- | 0.00038681- |
| 512KB | 0.00198606- | 0.00033332- | 0.00027255- | 0.00026623- | 0.00026589- |
|   1MB | 0.00193081- | 0.00026416- | 0.00026161- | 0.00026133- | 0.00026133- |
 -----------------------------------------------------------------------------
 Compulsory: 0.00001698143-

In this example (164.gzip), a 32KB 2-way set-associative L1 unified cache with 64-Byte blocks has approximately 18 cache misses per 1,000 instructions.

The tables of miss-ratios are organized into a set of files. For a given number of simulated instructions, there is one file for each benchmark-dataset combination, two files for the arithmetic and harmonic means of all the datasets for each benchmark, and six files for arithmetic and harmonic means of all the benchmarks (results for just the integer benchmarks and just the floating point benchmarks are provided). Each of the files contains seven tables. The first is for instruction caches; the second, third, and fourth for data caches; and the fifth, sixth, and seventh for unified caches. For the data caches and unified caches, the first table contains the miss ratios for all of the associated memory references, while the second does not count references or misses caused by prefetch operations (they do, however, affect cache state), and the third has statistics for only the prefetch operations.

Top of page


Miss Ratio Tables

Top of page


Experimental Error

The miss ratios were calculated from data collected by functional, user-mode simulations of optimized benchmarks. As a result, the cache miss ratios reported above may not be representative of a real platform. A few sources of error are discussed below.

First, only primary misses were counted by the simulator. Once a reference missed in the cache, the data was loaded and all subsequent accesses to the line hit. A modern processor may also experience secondary misses, or references to data that has yet to be loaded from a prior cache miss. There is a nonzero miss latency, and a real processor may execute other instructions while waiting for the data. The sequential model used in functional simulations is optimistic in this respect.

Second, a modern processor will have optimizations that affect cache performance. Hardware prefetching of instructions and data can have the positive effect of reducing the number of cache misses. However, prefetching can also cause cache pollution. Further, speculative execution can result in increased memory traffic for speculatively issued loads, and I-cache pollution from incorrect branch predictions. This also makes the results optimistic.

Third, the operating system was ignored. System calls cause additional cache misses to bring in OS code and data, and in doing so they replace cache lines from the user program. This increases the number of conflict and capacity misses for the user program in a real system. Since the additional misses from OS intervention were not modeled, our results are optimistic (though experiment showed these benchmarks typically spend less than 0.1% in the OS). One possibility is to flush the caches on system calls. However, this is the other extreme, and would have made it impossible to measure the compulsory miss rates.

Fourth, the benchmarks were optimized for an Alpha 21264 processor. The binaries may have been tuned to perform well with the 21264 cache hierarchy (split 64K 2-way set associative L1 caches). Ideally, the binary should not favor a particular cache configuration. Further, the binary contains no-ops for alignment and steering of dependant operations in the clustered microarchitecture of the 21264. These no-ops increase the overall instruction count for the functional simulation.

Fifth, since this is a functional simulation, the timeliness of prefetch operations is not considered. Prefetch operations can only prevent a cache miss on a demand reference if they are initiated early enough. Here, all subsequent accesses to a prefetched block hit in the cache. However, experiments with several benchmarks indicate that the compiler inserted prefetches suffiently far in advance of the first use of data to cover an L1 miss, with 10 to 100 comitted instructions between a prefetch and the first use.

Top of page


Related Work

Top of page


Acknowledgements

Top of page


Publications

Top of page


Disclaimer

Data in this directory is correct to the best of our knowledge. However, we provide it, *AS IS* without an expressed or implied warranty, and we accept no responsibility for the consequences of the use or misuse of this data.

Top of page


Revision History

Top of page


Last updated May 2003, jfc.