# Hardware Support for NVM Programming

### Outline

- Ordering
- Transactions
- Write endurance

### Volatile Memory Ordering

- Write-back caching
  - Improves performance
  - Reorders writes to DRAM



- Reordering to DRAM does not break correctness
- Memory consistency orders stores between CPUs

### Persistent Memory (PM) Ordering

• Recovery depends on write ordering

```
STORE data[0] = 0xFOOD
STORE data[1] = 0xBEEF
STORE valid = 1
```



### Persistent Memory (PM) Ordering

• Recovery depends on write ordering



Reordering breaks recovery Recovery incorrectly considers garbage as valid data

# **Simple Solutions**

- Disable caching
- Write-through caching
- Flush entire cache at commit

### **Generalizing PM Ordering**

- 1: STORE data $[0] = 0 \times FOOD$
- **2: STORE** data[1] = 0xBEEF
- **3: STORE** valid = 1

### **Generalizing PM Ordering**

- **1: PERSIST** data[0]
- 2: PERSIST data[1] 1 ➡
- 3: PERSIST valid



Program order implies unnecessary constraints

Need interface to describe necessary constraints

### **Generalizing PM Ordering**

- 1: PERSIST data[0]
- 2: PERSIST data[1]
- 3: PERSIST valid



Need interface to expose necessary constraints

Expose persist concurrency; sounds like consistency!

# Memory Persistency: [Pelley, ISCA14] Memory Consistency for NVM

• Framework to reason about persist order while maximizing concurrency

- Memory consistency
  - Constrains order of loads and stores between CPUs
- Memory persistency

- Constrains order of writes with respect to failure

# Memory Persistency = Consistency + Recovery Observer

• Abstract failure as recovery observer

Observer sees writes to NVM

- Memory persistency
  - Constrains order of writes with respect to observer



# Ordering With Respect to Recovery Observer

- **STORE** data[0] =  $0 \times FOOD$
- **STORE** data[1] = 0xBEEF
- **STORE** valid = 1



## Ordering With Respect to Recovery Observer

- STORE data[0] = 0xFOOD
  STORE data[1] = 0xBEEF
  STORE valid = 1
  - CPU
     data
     GARBAGE

     CPU
     valid
     1

     CPU
     Memory Persistency (view after crash)
     0

### **Persistency Design Space**

Happens Before: Vo

Volatile Memory Order

Persistent Memory Order

Strict persistency: single memory order



 Relaxed persistency: separate volatile and (new) persistent memory orders



### Outline

- Ordering
  - Intel x86 ISA extensions [Intel14]
  - BPFS epochs barriers [Condit, SOSP 09] ersistency
  - Strand persistency [Pelley, ISCA14]
- Transactions
- Write endurance

Relax

### Ordering with Existing Hardware

• Order writes by flushing cachelines via CLFLUSH

STORE data[0] = 0xFOOD
STORE data[1] = 0xBEEF
CLFLUSH data[0]
CLFLUSH data[1]
STORE valid = 1

- But CLFLUSH:
  - Stalls the CPU pipeline and serializes execution



### Ordering with Existing Hardware

• Order writes by flushing cachelines via CLFLUSH

STORE data[0] = 0xFOOD
STORE data[1] = 0xBEEF
CLFLUSH data[0]
CLFLUSH data[1]
STORE valid = 1

- But CLFLUSH:
  - Stalls the CPU pipeline and serializes execution
  - Invalidates the cacheline
  - Only sends data to the memory subsystem does
     not commit data to NVM

### Fixing CLFLUSH: Intel x86 Extensions

- CLFLUSHOPT
- CLWB
- PCOMMIT

#### **CLFLUSHOPT**

- Provides unordered version of CLFLUSH
- Supports efficient cache flushing

STORE data[0] = 0xFOOD STORE data[1] = 0xBEEF CLFLUSHOPT data[0] CLFLUSHOPT data[1] SFENCE // explicit ordering point STORE valid = 1

#### **CLFLUSHOPT**

- Provides unordered version of CLFLUSH
- Supports efficient cache flushing



#### CLWB

- Write backs modified data of a cacheline
- Does not invalidate the line from the cache

   Marks the line as non-modified
- <u>Note</u>: Following examples use CLWB

#### PCOMMIT

 Commits data writes queued in the memory subsystem to NVM

STORE data[0] = 0xFOOD
STORE data[1] = 0xBEEF
CLWB data[0]
CLWB data[1]
SFENCE // orders subsequent PCOMMIT
PCOMMIT // commits data[0], data[1]
SFENCE // orders subsequent stores
STORE valid = 1

Limitation: PCOMMITs execute serially





STORE Y' STORE Z' CLWB Y' CLWB Z' PCOMMIT STORE R' CLWB R'



STORE Y' STORE Z' CLWB Y' CLWB Z' PCOMMIT STORE R' CLWB R' PCOMMIT STORE X' STORE Z'' CLWB X' CLWB Z'' PCOMMIT STORE R' CLWB R'

### Example: Copy on Write – Timeline



## Outline

- Ordering
  - Intel x86 ISA extensions [Intel14]
  - BPFS epochs barriers [SOSP09]
  - Strand persistency [ISCA14]
- Transactions
- Write endurance

Relax persistency

### **BPFS Epochs Barriers**

[Condit, SOSP09]

• Barriers separate execution into epochs: sequence of writes to NVM from the same thread

Epoch STORE ... EPOCH\_BARRIER Epoch STORE ... EPOCH\_BARRIER Epoch STORE ... EPOCH\_BARRIER Epoch STORE ...

A younger write is issued to NVM only after all previous epochs commit



STORE Y' Epoch STORE Z' Epoch EPOCH\_BARRIER STORE R' Epoch



STORE Y' STORE Z' EPOCH\_BARRIER STORE R' EPOCH\_BARRIER STORE X' STORE Z'' EPOCH\_BARRIER STORE R'

#### Example: Copy on Write - Failure





#### STORE Y' STORE Z' EPOCH\_BARRIER STORE R'

#### WRITEBACK X' WRITEBACK Y'



### **PCOMMIT/CLWB VS Epochs Barriers**



BPFS Epochs Barriers: Ordering between threads

 Epochs also capture read-write dependencies between threads
 Recovery

| Thread 0                           | <u>Thread 1</u> | <b>Observer</b> |
|------------------------------------|-----------------|-----------------|
| Epoch <b>STORE</b><br><b>STORE</b> | Memory Co       | hsisteriggyst   |
|                                    | makesstingde    | persult         |
| EPOCH_BARRIER                      | perioletto t    | hread 1         |
| Epoch <b>I STORE R</b>             | •               | PERSIST R       |
| EPOCH_BARRIER<br>Epoch   STORE     |                 |                 |
| lust make dependency visible to    | STORE V         | PERSIST V       |

NVM to ensure crash consistency

M



### **Epoch Hardware Proposal**



- Per-processor epoch ID tags writes
- Cache line stores epoch ID when it is modified
- Cache tracks oldest in-flight epoch per CPU

34

### Epoch HW: Ordering Within a Thread Cascading Writebacks



### Epoch HW: Ordering Within a Thread Overwrites



J and older epochs

# **Epoch HW: Ordering Between Threads**



. dependency

#### Summary

| Ordering primitive | Persists | Commits      |
|--------------------|----------|--------------|
| CLFLUSH            | Serial   | N/A          |
| PCOMMIT/CLWB       | Parallel | Synchronous  |
| Epochs             | Parallel | Asynchronous |

# Outline

- Ordering
  - Intel x86 ISA extensions
  - BPFS epochs barriers [Condit, SOSP09]
  - Strand persistency [Pelley, ISCA14]
- Transactions
- Write endurance

# seek (fd, 1024, SEEK\_SET); write (fd, data, 128); Non conflicting writes seek (fd, 2048, SEEK\_SET); / write (fd, data, 128);



seek (fd, 1024, ...);
write (fd, data, 128);

seek (fd, 2048, ...);
write (fd, data, 128);



STORE B EPOCH\_BARRIER STORE X

EPOCH\_BARRIER STORE D EPOCH\_BARRIER STORE Y

1:PERSIST B 2:PERSIST X 3:PERSIST D 4:PERSIST Y



Can we expose more persist concurrency?

#### **Strand Persistency**

- Divide execution into strands
- Each strand is an independent set of persists
  - All strands initially unordered
  - Conflicting accesses establish persist order
- NewStrand instruction begins each strand
- Barriers continue to order persists within each strand as in epoch persistency

#### Strand Persistency: Example



#### Strands remove unnecessary ordering constraints

Strands Expose More Persist Concurrency

seek (fd, 1024, ...);
write (fd, data, 128);

NEW\_STRAND STORE B EPOCH\_BARRIER STORE X

seek (fd, 2048, ...);
write (fd, data, 128);

NEW\_STRAND STORE D EPOCH\_BARRIER STORE Y

# Strands Expose More Persist Concurrency



1:PERSIST B[0] 2:PERSIST B[1] 3:PERSIST X 4:PERSIST D[0] 5:PERSIST D[1] 6:PERSIST Y



#### Summary

| Ordering primitive | Persists | Commits                 |
|--------------------|----------|-------------------------|
| CLFLUSH            | Serial   | N/A                     |
| PCOMMIT/CLWB       | Parallel | Synchronous             |
| Epochs             | Parallel | Asynchronous            |
| Strands            | Parallel | Asynchronous + Parallel |

# Outline

- Ordering
- Transactions
  - Restricted transactional memory [Dulloor, EuroSys14]
  - Multiversioned memory hierarchy [Zhao, MICRO13]
- Write endurance

# Software-based Atomicity is Costly

- Atomicity relies on multiple data copies (versions) for recovery
  - Write-ahead logging: write intended updates to a log
  - Copy on write: write updates to new locations

- Software cost
  - Mem-copying for creating multiple data versions
  - Bookkeeping information for maintaining versions

### **Restricted Transactional Memory (RTM)**

• Intel's RTM supports failure-atomic 64-byte cache line writes

XBEGIN STORE A STORE B XEND A. B can be



RTM prevents A, B from leaving cache **before** commit (for isolation)

A, B can be now written back to NVM atomically

Existence proof that PM can leverage hardware TM

# Multiversioning: Leveraging Caching for In-place Updates

- How does a write-back cache work?
  - A processor writes a value
  - Old values remain in lower levels
  - Until the new value gets evicted



# Multiversioning: Leveraging Caching for In-place Updates

- How does a write-back cache work?
  - A processor writes a value
  - Old values remain in lower levels
  - Until the new value gets evicted
- Insight: multiversioned system by nature
  - Allow in-place updates to directly overwrite original data
  - No need for logging or copy-on-write



A Multiversioned Persistent Memory Hierarchy

# Preserving Write Ordering: Out-of-order Writes + In-order Commits

- Out-of-order writes to NV-LLC
  - NV-LLC remembers the committing state of each cache line
- In-order commits of transactions
  - Example:  $T_A$  before  $T_B$
  - $-T_{B}$  will not commit until  $A_{2}'$  arrives in NV-LLC
- Committing a transaction
  - Flush higher-level caches (very fast)
  - Change cache line states in NV-LLC

 $T_{A} = \{A_{1}, A_{2}, A_{3}\}$  $T_{B} = \{B_{1}, B_{2}\}$ 

Higher-Level Caches

Out-of-order  $A'_3, B'_2, A'_1, \downarrow \downarrow \downarrow \downarrow$   $B'_1, A'_2$ NV-LLC  $\clubsuit$ NVM

### A Hardware Memory Barrier

- Why
  - Prevents early eviction of uncommitted transactions
  - Avoids violating atomicity

$$T_A = \{A_1, A_2, A_3\}$$



- How
  - Extend replacement policy with transaction-commit info to keep uncommitted transactions in NV-LLC
  - Handle NV-LLC overflows using OS supported CoW

# Outline

- Ordering
- Transactions
- Write endurance
  - Start-gap wear leveling [Qureshi, MICRO09]
  - Dynamically replicated memory [lpek, ASPLOS10]

<u>Note</u>: Mechanisms target NVM-based main memory

#### Start-Gap Wear Leveling

- Table-based wear leveling is too costly for NVM
   Storage overheads and indirection latency
- Instead, use algebraic mapping between logical and physical addresses

- Periodically remap a line to its neighbor

### Start-Gap Wear Leveling

- Table-based wear leveling is too costly for NVM
   Storage overheads and indirection latency
- Instead, use algebraic mapping between logical and physical addresses
  - Periodically remap a line to its neighbor



Memory lines

Gap line

#### Start-Gap Wear Leveling

- Table-based wear leveling is too costly for NVM
   Storage overheads and indirection latency
- Instead, use algebraic mapping between logical and physical addresses

#### - Periodically remap a line to its neighbor

Move *start* every **START** one gap rotation **START** 



GAP Wear-leveling:

GAP

GAP

**GAP** Move *gap* every

**GAP** 100 memory writes

NVMAddr = (Start+Addr); if (PhysAddr >= Gap) NVMAddr++

# **Dynamically Replicated Memory**

• Reuse faulty pages with non-overlapping faults



• Record pairings in a new level of indirection



## Summary

- Ordering support
  - Reduces unnecessary ordering constraints
  - Exposes persist concurrency
- Transaction support
  - Removes versioning software overheads
- Endurance support further increases lifetime

**Questions**?

#### **Backup Slides**

#### Randomized Start Gap

- Start gap may move spatially-close hot lines to other hot lines
- Randomize address space to spread hot regions uniformly

