January 2002
Jason F. Cantin
Department of Electrical and Computer Engineering
1415 Engineering Drive
University of Wisconsin-Madison
Madison, WI 53706-1691
jcantin@ece.wisc.edu
http://www.jfred.org
Mark D. Hill
Department of Computer Science
1210 West Dayton Street
University of Wisconsin-Madison
Madison, WI 53706-1685
markhill@cs.wisc.edu
http://www.cs.wisc.edu/~markhill
http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data
The SPEC CPU2000 benchmark suite (http://www.spec.org/osg/cpu2000) is a collection of 26 compute-intensive, non-trivial programs used to evaluate the performance of a computer's CPU, memory system, and compilers. The benchmarks in this suite were chosen to represent real-world applications, and thus exhibit a wide range of runtime behaviors. On this webpage, we present functional cache miss ratios and related statistics for the SPEC CPU2000 suite. In particular, split L1 cache sizes ranging from 1KB to 1MB with 64B blocks and associativities of 1, 2, 4, 8 and full. Most of this data was collected at the University of Wisconsin-Madison with the aid of the Simplescalar toolset (http://www.simplescalar.org).
All functional data was collected with simulators from the Alpha version of the Simplescalar toolset, version 3.0. These include sim-cache, sim-cheetah, and sim-outorder. Some of these simulators were modified for the task (for example, sim-cheetah was modified to handle programs longer than 2 billion instructions). For interval cache data, the simulators were modified to print stats every 100 million executed instructions. A combination of Perl and Tcsh scripts were used to launch, manage, and process the results of these simulations.
All benchmarks were compiled statically with heavy optimization for the Alpha AXP instruction set. Optimizations were targeted at the Alpha 21264 processor implementations, and include prefetches, square-root instructions, byte/word memory operations, and no-ops for alignment. All benchmarks were run to completion with all reference inputs, with three exceptions. Two of the data sets for Perl (253.perlbmk) required new processes to be spawned, which is not supported by the Simplescalar tools at this time. One of the data sets for 175.vpr does not produce correct results due to an undocumented bug in the Simplescalar emulation of the Alpha ISA. The benchmarks simulated execute over 7 trillion instructions for the reference input sets. Generating the functional L1 miss-ratio tables resulted in over 400 trillion simulated instructions. The total simulation load for all functional and timing-based simulations to be reported here totals 30 CPU-years.
All simulations were carried out on a combination of x86/Linux machines and Alpha/Tru64 servers in the University of Wisconsin-Madison's Computer Science Department and ECE Department. The majority of this load was managed by the Condor system, which distributed these jobs to vacant machines throughout the CS building.
All cache configurations simulated had 64B blocks for both L1 and L2 caches. All data reported here is for the LRU replacement policy, though data for other replacement policies was collected and may be placed on this site soon. This data does not include operating system effects, and caches were not flushed periodically nor on system calls.
We have collected data for all 26 benchmarks with reference inputs: 12 integer, 14 floating-point.
| Benchmark | Language | Type | Category | In SPEC95? |
|---|---|---|---|---|
| 164.gzip | C | Integer | Compression | No |
| 175.vpr | C | Integer | FPGA Circuit Placement and Routing | No |
| 176.gcc | C | Integer | C Compiler | Yes |
| 181.mcf | C | Integer | Combinatorial Optimization | No |
| 186.crafty | C | Integer | Game Playing: Chess | No |
| 197.parser | C | Integer | Word processing | No |
| 252.eon | C++ | Integer | Computer Visualization | No |
| 253.perlbmk | C | Integer | PERL Programming Language | Yes |
| 254.gap | C | Integer | Group Theory, Interpreter | No |
| 255.vortex | C | Integer | Object-oriented Database | Yes |
| 256.bzip2 | C | Integer | Compression | No |
| 300.twolf | C | Integer | Place and Route Simulator (CAE) | No |
| 168.wupwise | Fortran77 | Floating-Point | Physics, Quantum Chromodynamics | No |
| 171.swim | Fortran77 | Floating-Point | Shallow Water Modeling | Yes |
| 172.mgrid | Fortran77 | Floating-Point | Multi-grid Solver: 3D Potential Field | Yes |
| 173.applu | Fortran77 | Floating-Point | Parabolic/Elliptic Partial Diff. Eqns | Yes |
| 177.mesa | C | Floating-Point | 3-D Graphics Library | No |
| 178.galgel | Fortran90 | Floating-Point | Computational Fluid Dynamics | No |
| 179.art | C | Floating-Point | Image Recognition / Neural Nets | No |
| 183.equake | C | Floating-Point | Seismic Wave Propagation | No |
| 187.facerec | Fortran90 | Floating-Point | Image Processing: Face Recognition | No |
| 188.ammp | C | Floating-Point | Computational Chemistry | No |
| 189.lucas | Fortran90 | Floating-Point | Number Theory / Primality Testing | No |
| 191.fma3d | Fortran90 | Floating-Point | Finite-Element Crash Simulation | No |
| 200.sixtrack | Fortran77 | Floating-Point | High Energy Nuclear Physics Accelerator Design | No |
| 301.apsi | Fortran77 | Floating-Point | Meteorology: Pollutant Distribution | Yes |
The following summary information is independant of cache size and associativity simulated. The third and fifth columns refer to the ratio of instruction fetches and data references that are to a unique 64B block (i.e., the data was not obtained in the last cache access). For example, one instruction cache access returns a block of 16 instructions, many of which may be executed before a different block must be accessed (typically 10 for these benchmarks). This data was obtained by simulating caches with a single 64B block.
| Benchmark | Instructions | I-Access/Inst | Data Refs | D-Access/Ref | Refs/Inst | % User Inst |
|---|---|---|---|---|---|---|
| 164.gzip | 478,636,174,329 | 0.1138 | 142,700,878,428 | 0.6451 | 0.2981 | 99.9 |
| 175.vpr | 84,125,622,844 | 0.1164 | 37,067,564,576 | 0.7811 | 0.4406 | 99.8 |
| 176.gcc | 243,597,914,726 | 0.1295 | 116,093,336,744 | 0.6229 | 0.4766 | 98.9 |
| 181.mcf | 61,870,158,860 | 0.1579 | 23,056,352,854 | 0.6465 | 0.3727 | 99.9 |
| 186.crafty | 191,882,992,412 | 0.0747 | 70,222,383,696 | 0.1291 | 0.3660 | 99.9 |
| 197.parser | 546,769,649,600 | 0.1243 | 190,517,359,797 | 0.6119 | 0.3484 | 99.9 |
| 252.eon | 239,768,148,508 | 0.1070 | 118,246,210,844 | 0.5680 | 0.4932 | 99.9 |
| 253.perlbmk | 143,122,956,639 | 0.1163 | 61,829,661,201 | 0.5805 | 0.4320 | 99.9 |
| 254.gap | 213,813,801,949 | 0.1198 | 80,924,423,445 | 0.6642 | 0.3785 | 99.9 |
| 255.vortex | 390,700,613,872 | 0.1310 | 161,133,019,186 | 0.6672 | 0.4124 | 99.9 |
| 256.bzip2 | 377,370,326,800 | 0.1179 | 145,002,261,443 | 0.6787 | 0.3842 | 99.9 |
| 300.twolf | 346,489,363,383 | 0.1025 | 111,857,479,345 | 0.7525 | 0.3228 | 99.9 |
| 168.wupwise | 349,623,875,977 | 0.0938 | 107,613,170,820 | 0.5629 | 0.3078 | 99.9 |
| 171.swim | 225,830,970,951 | 0.0685 | 74,341,437,755 | 0.9314 | 0.3292 | 99.9 |
| 172.mgrid | 419,156,008,460 | 0.0638 | 153,909,315,484 | 0.9151 | 0.3672 | 99.9 |
| 173.applu | 223,883,653,813 | 0.0641 | 85,459,068,028 | 0.7993 | 0.3817 | 99.9 |
| 177.mesa | 281,775,068,600 | 0.1086 | 108,712,910,562 | 0.5886 | 0.3858 | 99.8 |
| 178.galgel | 409,366,700,368 | 0.1008 | 178,742,879,478 | 0.8742 | 0.4366 | 99.9 |
| 179.art | 86,834,976,688 | 0.1514 | 30,279,186,530 | 0.9089 | 0.3487 | 99.9 |
| 183.equake | 131,518,705,120 | 0.0886 | 58,248,603,550 | 0.7544 | 0.4429 | 99.9 |
| 187.facerec | 211,027,395,856 | 0.0857 | 66,872,909,521 | 0.6323 | 0.3169 | 99.9 |
| 188.ammp | 326,549,217,724 | 0.0833 | 125,189,421,217 | 0.6624 | 0.3834 | 99.8 |
| 189.lucas | 142,398,814,292 | 0.0707 | 31,507,111,538 | 0.7807 | 0.2213 | 99.9 |
| 191.fma3d | 268,361,331,300 | 0.0797 | 118,043,791,674 | 0.7015 | 0.4399 | 99.9 |
| 200.sixtrack | 470,950,788,817 | 0.0683 | 116,965,122,302 | 0.7646 | 0.2484 | 98.2 |
| 301.apsi | 347,923,962,507 | 0.0798 | 129,508,475,036 | 0.7299 | 0.3722 | 99.8 |
| Int Total | 3,318,147,723,922 | 0.1189 | 1,258,650,931,559 | 0.6376 | 0.3793 | |
| Int Mean | 276,512,310,327 | 0.1176 | 104,887,577,630 | 0.6123 | 0.3938 | |
| FP Total | 3,895,201,470,473 | 0.0828 | 1,385,393,403,495 | 0.7560 | 0.3557 | |
| FP Mean | 278,228,676,462 | 0.0862 | 98,956,671,679 | 0.7576 | 0.3559 | |
| Ovrl Total | 7,213,349,194,415 | 0.1001 | 2,644,044,335,048 | 0.6968 | 0.3665 | |
| Ovrl Mean | 277,436,507,478 | 0.1007 | 101,694,012,887 | 0.6905 | 0.3734 |
Note: For columns that already contain ratios, the "Total" represents the sum of all the numerators divided by the sum of all the denominators, and the "Mean" represents the arithmetic means of the computed ratios.
All miss-ratio tables are in ASCII text format, generated with Perl scripts. They include the name of the file, the name of the benchmark, the command line for the benchmark, the number of instructions, the number of data references, miss-ratios (misses/reference) for a set of cache sizes and associativities, and compulsory miss rates. For each benchmark and data set, miss ratios are rounded to 8 decimal places. The computed arithmetic means for each benchmark are rounded to 7 digits, and the overall means are rounded to 6 digits. Miss ratios are reported for sizes of 1KB - 1MB, with associativities of 1-way, 2-way, 4-way, 8-way, and full. In all cases the block size was 64B's and the replacement policy was LRU. Compulsory miss-rates were measured as the miss-rate of a fully-associative 256MB cache with no flushing, and rounded to 12 places. Note that there is sufficient data to calculate the 3C's for the various configurations. See the example below (overall arithmetic mean for selected benchmarks)
-------------------------------------------------------------------------- | Block size: 64 bytes, Repl: LRU | |------------------------------------------------------------------------| | Arithmetic Mean for Instruction References | |------------------------------------------------------------------------| | | Associativity | | Size |----------------------------------------------------------------| | | 1 | 2 | 4 | 8 | full | |-------+------------+------------+------------+------------+------------| | 1K | 0.040115-- | 0.038059-- | 0.038609-- | 0.038631-- | 0.038770-- | | 2K | 0.028248-- | 0.026708-- | 0.026033-- | 0.026023-- | 0.026006-- | | 4K | 0.019655-- | 0.017775-- | 0.017586-- | 0.017514-- | 0.017421-- | | 8K | 0.013024-- | 0.011229-- | 0.010171-- | 0.010013-- | 0.009931-- | | 16K | 0.007394-- | 0.004766-- | 0.003666-- | 0.003405-- | 0.004296-- | | 32K | 0.003237-- | 0.001233-- | 0.000651-- | 0.000388-- | 0.000239-- | | 64K | 0.001060-- | 0.000360-- | 0.000127-- | 0.000049-- | 0.000016-- | | 128K | 0.000454-- | 0.000148-- | 0.000014-- | 0.000004-- | 0.000002-- | | 256K | 0.000090-- | 0.000011-- | 0.000002-- | 0.000001-- | 0.000001-- | | 512K | 0.000009-- | 0.000003-- | 0.000001-- | 0.000000-- | 0.000001-- | | 1024K | 0.000000-- | 0.000000-- | 0.000000-- | 0.000000-- | 0.000000-- | -------------------------------------------------------------------------- Compulsory: 0.0000000416-- -------------------------------------------------------------------------- | Block size: 64 bytes, Repl: LRU | |------------------------------------------------------------------------| | Arithmetic Mean for Data References | |------------------------------------------------------------------------| | | Associativity | | Size |----------------------------------------------------------------| | | 1 | 2 | 4 | 8 | full | |-------+------------+------------+------------+------------+------------| | 1K | 0.275311-- | 0.232072-- | 0.207868-- | 0.191097-- | 0.185660-- | | 2K | 0.191787-- | 0.155995-- | 0.137516-- | 0.123602-- | 0.115772-- | | 4K | 0.145548-- | 0.114026-- | 0.105337-- | 0.094777-- | 0.089413-- | | 8K | 0.106719-- | 0.085133-- | 0.078486-- | 0.074013-- | 0.069963-- | | 16K | 0.082798-- | 0.067679-- | 0.064007-- | 0.061553-- | 0.059314-- | | 32K | 0.069504-- | 0.056942-- | 0.055286-- | 0.053659-- | 0.052217-- | | 64K | 0.060102-- | 0.052060-- | 0.050989-- | 0.049836-- | 0.048541-- | | 128K | 0.051134-- | 0.048766-- | 0.048341-- | 0.046895-- | 0.045834-- | | 256K | 0.046695-- | 0.044774-- | 0.044566-- | 0.044497-- | 0.043546-- | | 512K | 0.041238-- | 0.040808-- | 0.041690-- | 0.041878-- | 0.040885-- | | 1024K | 0.033697-- | 0.032618-- | 0.033644-- | 0.034391-- | 0.034436-- | -------------------------------------------------------------------------- Compulsory: 0.0000293378--
For example, for a 4KB direct-mapped L1 data cache with 64-Byte blocks, approximately 146 out of every 1,000 data references miss. When neglecting the operating system, 29 out of every 1,000,000 data references cause a compulsory miss.
The miss ratios were calculated from data collected by functional, user-mode simulations of optimized benchmarks. As a result, the cache miss ratios reported above may not be representative of a real platform. A few sources of error are discussed below.
First, only primary misses were counted by the simulator. Once a reference missed in the cache, the data was loaded and all subsequent accesses to the line hit. A modern processor may also experience secondary misses, or references to data that has yet to be loaded from a prior cache miss. There is a nonzero miss latency, and a real processor may execute other instructions while waiting for the data. The sequential model used in functional simulations is optimistic in this respect.
Second, a modern processor will have optimizations that affect cache performance. Hardware prefetching of instructions and data can have the positive effect of reducing the number of cache misses. However, prefetching can also cause cache pollution. Further, speculative execution can result in increased memory traffic for speculatively issued loads, and I-cache pollution from incorrect branch predictions. This also makes the results optimistic.
Third, the operating system was ignored. System calls cause additional cache misses to bring in OS code and data, and in doing so they replace cache lines from the user program. This increases the number of conflict and capacity misses for the user program in a real system. Since the additional misses from OS intervention were not modeled, our results are optimistic. One possibility is to flush the caches on system calls. However, this is the other extreme, and would have made it impossible to measure the compulsory miss rates.
Fourth, all prefetch instructions (loads to R31) were treated as normal references. All were executed, and references from prefetch instructions were included in the overall statistics. Although prefetch instructions may prevent (or reduce the impact of) cache misses from instructions in the original code, the misses still occur (just sooner). However, prefetch instructions increase the overall hit ratio because the subsequent loads and stores that hit in the cache add to the overall hit count. One possibility is to ignore prefetch instructions altogether (the Alpha ISA allows this). Another possibility is to count the misses from the prefetches, but not count them as instructions.
Fifth, the benchmarks were optimized for an Alpha 21264 processor. The binaries may have been tuned to perform well with the 21264 cache hierarchy (64K 2-way L1 caches). Ideally, the binary should not favor a particular cache configuration. Further, the binary contains no-ops for alignment and steering of dependant operations in the clustered microarchitecture of the 21264. These no-ops increase the overall instruction count for the functional simulation.
Data in this directory is correct to the best of our knowledge. However, we provide it, *AS IS* without an expressed or implied warranty, and we accept no responsibility for the consequences of the use or misuse of this data.
Last updated January 2002, jfc. Report any dead links or errors to jfc.