May 2003
Jason F. Cantin
Department of Electrical and Computer Engineering
1415 Engineering Drive
University of Wisconsin-Madison
Madison, WI 53706-1691
jcantin@ece.wisc.edu
http://www.jfred.org
Mark D. Hill
Department of Computer Science
1210 West Dayton Street
University of Wisconsin-Madison
Madison, WI 53706-1685
markhill@cs.wisc.edu
http://www.cs.wisc.edu/~markhill
http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data
The SPEC CPU2000 benchmark suite (http://www.spec.org/osg/cpu2000) is a collection of 26 compute-intensive, non-trivial programs used to evaluate the performance of a computer's CPU, memory system, and compilers. The benchmarks in this suite were chosen to represent real-world applications, and thus exhibit a wide range of runtime behaviors. On this webpage, we present functional cache miss ratios and related statistics for the SPEC CPU2000 suite. In particular, L1 instruction, L1 data, and L1 unified caches ranging from 1KB to 1MB with 64B blocks and associativities of 1, 2, 4, 8 and full. Prefetch operations were always executed, but results are posted both with and without them counted in the hit ratios. Most of this data was collected at the University of Wisconsin-Madison with the aid of the Simplescalar toolset (http://www.simplescalar.org).
All functional data was collected with modified simulators from the Alpha version of the Simplescalar toolset, version 3.0. The simulators were modified to simulate multiple caches at once, and report statistics every 1 billion instructions. Further, we modified the Simplescalar code to correctly distinguish between binding loads, prefetch instructions made from loads to R31, and the universal NOP composed from an unaligned quadword load to R31. A combination of Perl and Tcsh scripts were used to launch, manage, and process the results of these simulations.
All benchmarks were compiled statically with heavy optimization for the Alpha AXP instruction set. Optimizations were targeted at the Alpha 21264 implementations, and include prefetches, square-root instructions, byte/word memory operations, and no-ops for alignment. They do not contain profile-directed optimizations. All benchmarks were run to completion with all reference inputs, with three exceptions. Two of the data sets for Perl (253.perlbmk) required new processes to be spawned, which is not supported by the Simplescalar tools. One of the data sets for 175.vpr does not produce correct results due to an undocumented bug in the Simplescalar emulation of the Alpha ISA. The benchmarks simulated comprise over 7 trillion dynamic instructions for the reference input sets. Generating the functional L1 miss-ratio tables resulted in over 400 trillion simulated instructions. The total simulation load for all functional simulations to be reported here totals 7.6 CPU-years.
All caches simulated here used the LRU replacement policy. This data does not include operating system effects, and caches were not flushed on system calls. Thus, actual miss-rates will be higher than those reported here. See the section on experimental error below.
We have collected data for all 26 benchmarks with reference inputs: 12 integer, 14 floating-point. See the table below (Table 1).
Benchmark | Language | Type | Category | Inputs | % User Time |
---|---|---|---|---|---|
164.gzip | C | Integer | Compression | 5 | 99.9 |
175.vpr | C | Integer | FPGA Circuit Placement and Routing | 2 | 99.8 |
176.gcc | C | Integer | C Compiler | 5 | 98.9 |
181.mcf | C | Integer | Combinatorial Optimization | 1 | 99.9 |
186.crafty | C | Integer | Game Playing: Chess | 1 | 99.9 |
197.parser | C | Integer | Word processing | 1 | 99.9 |
252.eon | C++ | Integer | Computer Visualization | 3 | 99.9 |
253.perlbmk | C | Integer | PERL Programming Language | 7 | 99.9 |
254.gap | C | Integer | Group Theory, Interpreter | 1 | 99.9 |
255.vortex | C | Integer | Object-oriented Database | 3 | 99.9 |
256.bzip2 | C | Integer | Compression | 3 | 99.9 |
300.twolf | C | Integer | Place and Route Simulator (CAE) | 1 | 99.9 |
168.wupwise | Fortran77 | Float | Physics, Quantum Chromodynamics | 1 | 99.9 |
171.swim | Fortran77 | Float | Shallow Water Modeling | 1 | 99.9 |
172.mgrid | Fortran77 | Float | Multi-grid Solver: 3D Potential Field | 1 | 99.9 |
173.applu | Fortran77 | Float | Parabolic/Elliptic Partial Diff. Eqns | 1 | 99.9 |
177.mesa | C | Float | 3-D Graphics Library | 1 | 99.8 |
178.galgel | Fortran90 | Float | Computational Fluid Dynamics | 1 | 99.9 |
179.art | C | Float | Image Recognition / Neural Nets | 2 | 99.9 |
183.equake | C | Float | Seismic Wave Propagation | 1 | 99.9 |
187.facerec | Fortran90 | Float | Image Processing: Face Recognition | 1 | 99.9 |
188.ammp | C | Float | Computational Chemistry | 1 | 99.8 |
189.lucas | Fortran90 | Float | Number Theory / Primality Testing | 1 | 99.9 |
191.fma3d | Fortran90 | Float | Finite-Element Crash Simulation | 1 | 99.9 |
200.sixtrack | Fortran77 | Float | High Energy Nuclear Physics Accelerator Design | 1 | 98.2 |
301.apsi | Fortran77 | Float | Meteorology: Pollutant Distribution | 1 | 99.8 |
The summary information below (Table 2) is independent of cache size and associativity. The prefetches are included in the counts/rates of loads and stores, since they were made from loads to R31. Means for each of the counts are arithmetic, while means for the per-instruction rates are harmonic. The integer and floating-point benchmarks are averaged separately, and then those means are averaged for the overall mean. This gives equal weight to each benchmark in a set, and equal weighting of integer and floating-point data.
Benchmark | Instructions | Loads/Stores | Load/Store Rate | Prefetches | Prefetch Rate |
---|---|---|---|---|---|
164.gzip | 478,636,153,300 | 106,339,774,183 | 0.222172 | 210,749,035 | 0.000440 |
175.vpr | 84,068,682,462 | 34,290,404,520 | 0.407886 | 28,489,513 | 0.000339 |
176.gcc | 242,873,996,717 | 76,347,070,522 | 0.314348 | 3,223,614,266 | 0.013273 |
181.mcf | 61,867,398,195 | 21,224,890,959 | 0.343071 | 660,286,828 | 0.010673 |
186.crafty | 191,882,992,463 | 63,216,831,368 | 0.329455 | 502,573 | 0.000003 |
197.parser | 546,749,971,166 | 157,370,546,967 | 0.287829 | 7,148,027 | 0.000013 |
252.eon | 239,766,530,848 | 107,716,995,277 | 0.449258 | 3,956,650,303 | 0.016502 |
253.perlbmk | 407,741,332,530 | 178,536,860,186 | 0.437868 | 70,854,377 | 0.000174 |
254.gap | 213,814,233,713 | 68,971,552,540 | 0.322577 | 24,459,785 | 0.000114 |
255.vortex | 390,698,783,316 | 153,384,434,949 | 0.392590 | 1,975,965,046 | 0.005058 |
256.bzip2 | 377,370,320,254 | 133,004,113,614 | 0.352450 | 133,248,276 | 0.000353 |
300.twolf | 346,484,742,706 | 97,313,583,371 | 0.280860 | 23,200,957 | 0.000067 |
168.wupwise | 349,623,881,589 | 94,914,535,141 | 0.271476 | 8,264,448,086 | 0.023638 |
171.swim | 225,830,975,667 | 74,301,504,811 | 0.329014 | 10,310,829,246 | 0.045657 |
172.mgrid | 419,156,007,205 | 153,695,716,464 | 0.366679 | 9,254,489,893 | 0.022079 |
173.applu | 223,883,656,387 | 85,299,162,334 | 0.380998 | 1,961,452,391 | 0.008761 |
177.mesa | 281,694,536,771 | 100,831,728,471 | 0.357947 | 350,031,487 | 0.001243 |
178.galgel | 409,366,708,755 | 177,448,399,787 | 0.433471 | 58,409,867,463 | 0.142683 |
179.art | 86,831,419,686 | 28,608,040,144 | 0.329466 | 6,045,927,514 | 0.069628 |
183.equake | 131,518,590,672 | 55,938,011,470 | 0.425324 | 126,096,577 | 0.000959 |
187.facerec | 211,027,958,400 | 63,321,687,774 | 0.300063 | 1,832,311,610 | 0.008683 |
188.ammp | 326,548,871,460 | 109,074,260,280 | 0.334021 | 1,455,221,518 | 0.004456 |
189.lucas | 142,398,816,802 | 30,986,842,904 | 0.217606 | 2,134,334,895 | 0.014988 |
191.fma3d | 268,370,708,343 | 115,831,207,504 | 0.431609 | 1,349,475,397 | 0.005028 |
200.sixtrack | 470,949,157,978 | 108,391,446,800 | 0.230155 | 10,556,998,171 | 0.022416 |
301.apsi | 347,924,060,406 | 124,663,297,530 | 0.358306 | 5,381,960,075 | 0.015469 |
Int Total | 3,581,955,137,670 | 1,197,717,058,456 | 10,315,168,986 | ||
Int Mean | 298,496,261,472 | 99,809,754,871 | 0.332290 | 859,597,415 | 0.000024 |
FP Total | 3,895,125,350,121 | 1,323,305,841,414 | 117,433,444,323 | ||
FP Mean | 278,223,239,294 | 94,521,845,815 | 0.326063 | 8,388,103,165 | 0.004987 |
Ovrl Total | 7,477,080,487,791 | 2,521,022,899,870 | 127,748,613,309 | ||
Ovrl Mean | 288,359,750,383 | 97,165,800,343 | 0.329147 | 4,623,850,290 | 0.000048 |
The table below (Table 3) shows how often a new block must be obtained from the cache to satisfy the request. For example, one instruction cache access returns a block of 16 instructions, many of which may be executed before a different block must be obtained (typically 10 for these benchmarks).
This data was obtained by simulating caches with a single 64-Byte block, and may be more meaningful for implementations that can buffer blocks or merge requests. The means are computed in the same way as Table 2 above.
Benchmark | I$-Access | I$-Access Rate | D$-Access | D$-Access Rate | U$-Access | U$-Access Rate |
---|---|---|---|---|---|---|
164.gzip | 54,470,768,520 | 0.113804 | 76,846,749,520 | 0.160554 | 264,464,152,198 | 0.552537 |
175.vpr | 9,783,417,461 | 0.116374 | 26,450,333,116 | 0.314628 | 75,711,030,995 | 0.900585 |
176.gcc | 31,450,635,056 | 0.129494 | 43,842,754,917 | 0.180516 | 181,733,952,330 | 0.748264 |
181.mcf | 9,766,961,907 | 0.157869 | 13,989,421,988 | 0.226119 | 52,061,791,386 | 0.841506 |
186.crafty | 21,974,406,397 | 0.114520 | 47,422,033,028 | 0.247140 | 145,207,265,223 | 0.756749 |
197.parser | 67,980,418,633 | 0.124335 | 99,595,662,954 | 0.182159 | 373,494,790,264 | 0.683118 |
252.eon | 25,652,922,924 | 0.106991 | 62,785,515,113 | 0.261861 | 235,089,422,313 | 0.980493 |
253.perlbmk | 47,401,063,428 | 0.116253 | 105,176,984,033 | 0.257950 | 395,433,450,861 | 0.969814 |
254.gap | 25,606,586,268 | 0.119761 | 49,105,988,033 | 0.229667 | 161,143,617,345 | 0.753662 |
255.vortex | 51,181,968,365 | 0.131001 | 103,480,838,369 | 0.264861 | 348,800,683,706 | 0.892761 |
256.bzip2 | 45,086,528,330 | 0.119476 | 79,924,432,801 | 0.211793 | 302,917,293,906 | 0.802706 |
300.twolf | 35,524,277,102 | 0.102528 | 73,552,792,605 | 0.212283 | 222,877,780,050 | 0.643254 |
168.wupwise | 32,803,956,134 | 0.093826 | 54,126,522,525 | 0.154814 | 216,266,923,190 | 0.618570 |
171.swim | 15,463,978,992 | 0.068476 | 69,220,294,011 | 0.306514 | 159,885,506,394 | 0.707987 |
172.mgrid | 26,737,536,917 | 0.063789 | 140,740,025,878 | 0.335770 | 324,975,820,808 | 0.775310 |
173.applu | 14,351,426,465 | 0.064102 | 68,246,688,564 | 0.304831 | 180,277,378,414 | 0.805228 |
177.mesa | 30,602,545,093 | 0.108637 | 60,630,249,848 | 0.215234 | 225,901,437,732 | 0.801938 |
178.galgel | 41,243,445,388 | 0.100749 | 155,587,844,182 | 0.380070 | 389,597,173,029 | 0.951707 |
179.art | 13,144,486,961 | 0.151379 | 26,652,374,302 | 0.306944 | 69,084,709,620 | 0.795619 |
183.equake | 11,658,423,445 | 0.088645 | 43,425,718,299 | 0.330187 | 119,186,908,018 | 0.906236 |
187.facerec | 18,076,470,614 | 0.085659 | 40,091,054,768 | 0.189980 | 142,125,671,080 | 0.673492 |
188.ammp | 27,201,946,160 | 0.083301 | 76,187,755,090 | 0.233312 | 241,195,708,949 | 0.738621 |
189.lucas | 10,063,843,673 | 0.070674 | 24,380,575,286 | 0.171213 | 70,125,610,574 | 0.492459 |
191.fma3d | 21,390,682,566 | 0.079706 | 81,415,531,600 | 0.303370 | 246,070,861,667 | 0.916907 |
200.sixtrack | 32,179,807,823 | 0.068330 | 85,712,543,482 | 0.182000 | 240,948,906,996 | 0.511624 |
301.apsi | 27,762,425,232 | 0.079794 | 92,471,252,932 | 0.265780 | 270,809,995,354 | 0.778359 |
Int Total | 425,879,954,391 | 782,173,506,477 | 2,758,935,230,577 | |||
Int Mean | 35,489,996,199 | 0.119680 | 65,181,125,539 | 0.221556 | 229,911,269,214 | 0.772763 |
FP Total | 322,680,975,463 | 1,018,888,430,767 | 2,896,452,611,825 | |||
FP Mean | 23,048,641,104 | 0.081825 | 72,777,745,054 | 0.243528 | 206,889,472,273 | 0.720977 |
Ovrl Total | 748,560,929,854 | 1,801,061,937,244 | 5,655,387,842,402 | |||
Ovrl Mean | 224,464,297,747 | 0.097197 | 427,475,625,765 | 0.232023 | 1,482,912,351,425 | 0.745973 |
All miss-ratio tables (.tab files) are ASCII text. They include the name and command line for each benchmark; the number of instructions, data references, data prefetches, and the miss-ratios (misses/instruction, rounded to 9 places) for cache sizes ranging from 1KB to 1MB, and associativities of 1, 2, 4, 8, and full. The tables also contain compulsory miss-rates and access-rates. Both arithmetic means and harmonic means were computed for the data sets of each benchmark (rounded to 8 places). The means for each benchmark were then averaged together, and rounded to 7 places. In all cases the block size was 64B and the replacement policy was LRU (least recently used). Compulsory miss-rates were measured as the miss-rate of a 2-way set-associative 256MB cache with no flushing on system calls (rounded to 12 places). Access-rates were measured as the miss-rate for a direct-mapped, 64B cache --having just one block. Note that there is sufficient data to calculate the 3C's for the various configurations. See the example below
----------------------------------------------------------------------------- | U-cache misses/inst: 584,975,927,483 unified refs (1.231289-/inst); | |-----------------------------------------------------------------------------| | 264,464,152,198 U-cache 64-Byte block accesses (0.560264-/inst) | |-----------------------------------------------------------------------------| | Size | Direct | 2-way LRU | 4-way LRU | 8-way LRU | Full LRU | |-------+-------------+-------------+-------------+-------------+-------------| | 1KB | 0.17096965- | 0.11586859- | 0.10006949- | 0.09539379- | 0.09356626- | | 2KB | 0.09933510- | 0.07301168- | 0.06116419- | 0.05846171- | 0.06111943- | | 4KB | 0.06756154- | 0.04373756- | 0.03517036- | 0.02650259- | 0.02410843- | | 8KB | 0.05398704- | 0.02824148- | 0.02123935- | 0.02024346- | 0.01982071- | | 16KB | 0.03316842- | 0.02309782- | 0.01727542- | 0.01709368- | 0.01694758- | | 32KB | 0.02622252- | 0.01814185- | 0.01381846- | 0.01369148- | 0.01354134- | | 64KB | 0.01397891- | 0.01160836- | 0.00835915- | 0.00821407- | 0.00807335- | | 128KB | 0.00583968- | 0.00267375- | 0.00189210- | 0.00172267- | 0.00151421- | | 256KB | 0.00343062- | 0.00054402- | 0.00040227- | 0.00038742- | 0.00038681- | | 512KB | 0.00198606- | 0.00033332- | 0.00027255- | 0.00026623- | 0.00026589- | | 1MB | 0.00193081- | 0.00026416- | 0.00026161- | 0.00026133- | 0.00026133- | ----------------------------------------------------------------------------- Compulsory: 0.00001698143-
In this example (164.gzip), a 32KB 2-way set-associative L1 unified cache with 64-Byte blocks has approximately 18 cache misses per 1,000 instructions.
The tables of miss-ratios are organized into a set of files. For a given number of simulated instructions, there is one file for each benchmark-dataset combination, two files for the arithmetic and harmonic means of all the datasets for each benchmark, and six files for arithmetic and harmonic means of all the benchmarks (results for just the integer benchmarks and just the floating point benchmarks are provided). Each of the files contains seven tables. The first is for instruction caches; the second, third, and fourth for data caches; and the fifth, sixth, and seventh for unified caches. For the data caches and unified caches, the first table contains the miss ratios for all of the associated memory references, while the second does not count references or misses caused by prefetch operations (they do, however, affect cache state), and the third has statistics for only the prefetch operations.
The miss ratios were calculated from data collected by functional, user-mode simulations of optimized benchmarks. As a result, the cache miss ratios reported above may not be representative of a real platform. A few sources of error are discussed below.
First, only primary misses were counted by the simulator. Once a reference missed in the cache, the data was loaded and all subsequent accesses to the line hit. A modern processor may also experience secondary misses, or references to data that has yet to be loaded from a prior cache miss. There is a nonzero miss latency, and a real processor may execute other instructions while waiting for the data. The sequential model used in functional simulations is optimistic in this respect.
Second, a modern processor will have optimizations that affect cache performance. Hardware prefetching of instructions and data can have the positive effect of reducing the number of cache misses. However, prefetching can also cause cache pollution. Further, speculative execution can result in increased memory traffic for speculatively issued loads, and I-cache pollution from incorrect branch predictions. This also makes the results optimistic.
Third, the operating system was ignored. System calls cause additional cache misses to bring in OS code and data, and in doing so they replace cache lines from the user program. This increases the number of conflict and capacity misses for the user program in a real system. Since the additional misses from OS intervention were not modeled, our results are optimistic (though experiment showed these benchmarks typically spend less than 0.1% in the OS). One possibility is to flush the caches on system calls. However, this is the other extreme, and would have made it impossible to measure the compulsory miss rates.
Fourth, the benchmarks were optimized for an Alpha 21264 processor. The binaries may have been tuned to perform well with the 21264 cache hierarchy (split 64K 2-way set associative L1 caches). Ideally, the binary should not favor a particular cache configuration. Further, the binary contains no-ops for alignment and steering of dependant operations in the clustered microarchitecture of the 21264. These no-ops increase the overall instruction count for the functional simulation.
Fifth, since this is a functional simulation, the timeliness of prefetch operations is not considered. Prefetch operations can only prevent a cache miss on a demand reference if they are initiated early enough. Here, all subsequent accesses to a prefetched block hit in the cache. However, experiments with several benchmarks indicate that the compiler inserted prefetches suffiently far in advance of the first use of data to cover an L1 miss, with 10 to 100 comitted instructions between a prefetch and the first use.
Data in this directory is correct to the best of our knowledge. However, we provide it, *AS IS* without an expressed or implied warranty, and we accept no responsibility for the consequences of the use or misuse of this data.
Last updated May 2003, jfc.