

# **Decoupled Compressed Cache:**

**Exploiting Spatial Locality for Energy-Optimized Compressed Caching** 

Somayeh Sardashti and Professor David A. Wood

**University of Wisconsin-Madison** 

### **Optimizing Memory Hierarchy for Energy**

Maximize LLC effective capacity to reduce system energy!

Access to main memory vs. LLC: **6X Longer Latency 60X Higher Energy Cost** 

Why not double the LLC? 15%-30% of on-chip area **2X LLC Area** 



#### Intel Nehalem

## **Exploiting Spatial Locality**



### **Potentials and Limits of Compressed Caches**

**Compressed Caching: Compressing and compacting cache blocks** 

### **Potentials:**

- + Higher effective cache size
- + Low area overhead
- + Higher system performance
- + Lower system energy

#### **Potentially 3.9X larger LLC**



### Limits of previous work:

- Limited number of tags
- Internal fragmentation
- **Energy-expensive compactions**



|                            |         |    |                           |   |         |   |          |   | Baseline  | FixedC | VSC-   |
|----------------------------|---------|----|---------------------------|---|---------|---|----------|---|-----------|--------|--------|
|                            |         |    |                           |   |         |   |          | Ĺ |           |        |        |
| <ul> <li>apache</li> </ul> | jbb     |    | <ul> <li>oltp</li> </ul>  | × | zeus    | * | ammp     | • | applu     | +      | equake |
| - mgrid                    | – wupwi | se | <ul> <li>black</li> </ul> |   | canneal |   | freqmine | ж | m2        | •      | m3     |
| + <b>m4</b>                | - m5    |    | - m6                      | • | m7      |   | m8       |   | - GEOMEAN |        |        |

## **Decoupled Compressed Cache (DCC)**

| DCC |  |
|-----|--|
|     |  |
|     |  |

### Decoupled **Super-Blocks**

### **Non-contiguous Sub-Blocks**

### (Co-)DCC

- DCC exploits spatial locality to improve compression effectiveness:
  - Uses decoupled super-blocking to track more blocks with low area overhead. ullet
  - Compresses and allocates a block into non-contiguous data sub-blocks.
- **Co-DCC (Co-compacted DCC):** 
  - Co-compacting blocks of a super-block to reduce internal fragmentation.

### **DCC Implementation**

#### We integrate (Co-)DCC with AMD Bulldozer LLC. $\checkmark$

- No need for an alignment network
- ✓ We implement the tag match and the sub-selection logic in Verilog.

Syster

malized

Nor

Co-DCC

DCC

0.70

0.60

0.50

0.40

FixedC

VSC-2X

aseline

equake mgrid

wupwise

black

m1 m2

m3

m4

m5

m6 m7

m8

GEOMEAN

Co-DCC

DCC

canneal freqmine

No need for an alignment network



Normalized LLC

1.5

1

1

1.5

2

2.5

Norm Runtii

VSC-2X

Co-DCC

0.70

0.60

0.50

0.40

FixedC

VSC-2X

Baseline

**2**X

| Evaluation                                                                                                                                                                                                        |                                                                                                                                                                                                                                     |  |  |  |  |  |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|
| Experimental Methodology                                                                                                                                                                                          | Results                                                                                                                                                                                                                             |  |  |  |  |  |  |
| <ul> <li>We model a multicore system with GEMS.</li> <li>We use workloads from Commercial workloads,<br/>SPEC-OMP, PARSEC, and SPEC CPU2006.</li> <li>We use Cacti to measure (Co-)DCC power and area.</li> </ul> | <ul> <li>(Co-)DCC:</li> <li>Performs better than a conventional LLC of twice the capacity.</li> <li>Boosts system performance by 14% on average (up to 38%).</li> <li>Saves system energy by 12% on average (up to 39%).</li> </ul> |  |  |  |  |  |  |
| Cores Eight 000 cores, 3.2 GHz                                                                                                                                                                                    | Baseline                                                                                                                                                                                                                            |  |  |  |  |  |  |

| Cores       | Eight OOO cores, 3.2 GHz                  |
|-------------|-------------------------------------------|
| L1I\$/L1D\$ | Private, 32-KB, 8-way                     |
| L2\$        | Private, 256-KB, 8-way                    |
| L3\$        | Shared, 8-MB, 16-way, 8 banks             |
| Main Memory | 4GB, 16 Banks, 800 MHz bus frequency DDR3 |