



### Programmable Hardware Acceleration

Vinay Gangadhar

PhD Final Examination Thursday, Nov 16<sup>th</sup>, 2017

Advisor: Karu Sankaralingam

**Committee:** Mark Hill, Mikko Lipasti, David Wood, Dimitris Papailiopoulos



### **Computing Trends**





#### Device scaling slowdown (or dead) & Dark silicon problem





#### Emerging applications driving computing with new demands

The Big-Data Future Has Arrived



**Dissertation Talk** 



### Era of Specialization

Input Stream

W bytes per cycle

Character Clas



General Purpose Processor

PE PE Bus Sched

PE PE

0 0

#### **Traditional Multicore**



#### 11/16/2017

Performance/Area

#### **Dissertation Talk**



#### Caveats of Domain-Specific Accelerators (DSAs)





- Minimally programmable/ Not Re-configurable
- Obsoletion prone
- Domains targeting each device type
- Architecture, design, verification and fabrication cost
- Multi-DSA chip for "N" application domains →
   Area and cost inefficient
   11/16/2017 Dissertation Talk





#### The Universal Accelerator Dream...





#### matching the efficiency of Domain Specific Accelerators (DSAs) with an efficient hardware-software interface



### **Specialization Paradigms**







#### **Research Overview**





**Dissertation Talk** 



#### **Research Overview**







### **Dissertation Research Goal**



#### **Programmable Hardware Acceleration**

- 1. Explore the commonality in the way the DSAs specialize *Specialization Principles*
- 2. **General Mechanisms** for the design of a generic programmable hardware accelerator matching the efficiency of DSAs

3. A programmable/re-configurable accelerator architecture with an efficient accelerator hardware-software (ISA) interface

4. Easy adaptation of new acceleratable algorithms in a domain-agnostic way



#### **Dissertation Statement**



#### **Programmable Hardware Acceleration**

A *programmable hardware accelerator* nearing the efficiency of a domain-specific accelerator (DSA) is feasible to build by:

- Identifying the common principles of architectural specialization
- Applying general set of micro-architectural mechanisms for the identified principles
- Having an efficient hardware-software interface to be able to express any typical accelerator application



### Contributions



#### Modeling Programmable Hardware Acceleration

- Exploring the common principles of architectural specialization
- Modeling a general set of mechanisms to exploit the specialization principles – GenAccel Model
- Quantitative evaluation of GenAccel Model with four DSAs
- System-Level Tradeoffs of GenAccel Model vs. DSAs

#### **Architectural Realization with Stream-Dataflow Acceleration**

- Stream-Dataflow programmable accelerator architecture with:
  - Programming abstractions and execution model
  - ISA interface
- Detailed micro-architecture with an efficient architectural realization of stream-dataflow accelerator – Softbrain
- Quantitative evaluation of Softbrain with state-of-the-art DSA solutions





### Modeling Programmable Hardware Acceleration\*

\*Published in HPCA 2016, IEEE Micro Top Picks 2017



### Outline



- Principles of architectural specialization
  - Embodiment of principles in DSAs
- Modeling mechanisms exploiting specialization principles for a generic programmable accelerator (GenAccel Model)
- Evaluation of GenAccel with 4 DSAs (Performance, power & area)
- System-level energy efficiency tradeoffs with GenAccel and DSA









11/16/2017



# Key Insight: Commonality in DSAs' Specialization Principles





#### Most DSAs employ 5 common Specialization Principles





### Principles of Architectural Specialization



- Match hardware **concurrency** to that of algorithm
- Problem-specific **computation** units
- Explicit communication as opposed to implicit communication
- Customized structures for data reuse
- Hardware coordination using simple low-power control logic







#### **5** Specialization Principles



How do DSAs embody these principles in a domain specific way ?





### **Principles in DSAs**





### Outline



- Principles of architectural specialization
  - Embodiment of principles in DSAs
- Modeling mechanisms exploiting specialization principles for a generic programmable accelerator (GenAccel Model)
- Evaluation of GenAccel with 4 DSAs (Performance, power & area)
- System-level energy efficiency tradeoffs with GenAccel and DSA









11/16/2017

### Implementation of Principles in a General Way



Composition of simple micro-architectural mechanisms

- Concurrency: Multiple tiles (Tile hardware for coarse grain unit of work)
- Computation: Special FUs in spatial fabric
- **Communication:** Dataflow + spatial fabric
- Data Reuse: Scratchpad (SRAMs)
- **Coordination:** Low-power simple core



**Each Tile** 

## Modeling the Generic



### **Programmable Accelerator Design**





### **Instantiating GenAccel**





#### GenAccel Usage, Design point selection & Synthesis etc. More details in backup.....

**<sup>★</sup>F/g6/⁄20**\$1not to scale

**Dissertation Talk** 



### Outline



- Principles of architectural specialization
  - Embodiment of principles in DSAs
- Modeling mechanisms exploiting specialization principles for a generic programmable accelerator (GenAccel Model)
- Evaluation of GenAccel with 4 DSAs (Performance, power & area)
- System-level energy efficiency tradeoffs with GenAccel and DSA



Concurrency







11/16/2017



### Methodology



- Modeling framework for GenAccel
  - Performance: Trace driven simulator + application specific modeling
  - Power & Area: Synthesized modules, CACTI and McPAT
- Compared to four DSAs (published perf., area & power)
- Four parameterized GenAccels



One combined balanced GenAccel



- Provisioned to *match* performance of DSAs
  - Other tradeoffs possible (power, area, energy etc.)



### Performance Analysis GenAccel vs DSAs





#### Baseline – 4 wide OOO core (Intel 3770K)

**Dissertation Talk** 





#### **Domain Provisioned GenAccels**

#### GenAccel area & power compared to a single DSA ?





\*Detailed area breakdown in backup





#### **Balanced GenAccel design**

#### Area and power of GenAccel Balanced design, when multiple domains mapped\* ?

\* Still provisioned to match the performance of each DSA



### Outline



- Introduction
- Principles of architectural specialization Embodiment of principles in DSAs
- Modeling mechanisms exploiting specialization principles for a generic programmable accelerator (GenAccel Model)
- Evaluation of GenAccel with 4 DSAs (Performance, power & area)
- System-level energy efficiency tradeoffs with GenAccel and DSA 11/16/2017









#### **Conclusion – Modeling Programmable** Hardware Acceleration

• 5 common principles for architectural specialization



#### **Dissertation Talk**



### **Dissertation Research Goal**



#### **Programmable Hardware Acceleration**

- 1. Explore the commonality in the way the DSAs specialize *Specialization Principles*
- 2. **General Mechanisms** for the design of a generic programmable hardware accelerator matching the efficiency of DSAs

3. A programmable/re-configurable accelerator architecture with an efficient accelerator hardware-software (ISA) interface

4. Easy adaptation of new acceleratable algorithms in a domain-agnostic way



### Contributions



#### Modeling Programmable Hardware Acceleration

- Exploring the common principles of architectural specialization
- Modeling a general set of mechanisms to exploit the specialization principles – GenAccel Model
- Quantitative evaluation of GenAccel Model with four DSAs
- System-Level Tradeoffs of GenAccel Model vs. DSAs

#### Architectural Realization with Stream-Dataflow Acceleration

- Stream-Dataflow programmable accelerator architecture with:
  - Programming abstractions and execution model
  - ISA interface
- Detailed micro-architecture with an efficient architectural realization of stream-dataflow accelerator – Softbrain
- Quantitative evaluation of Softbrain with state-of-the-art DSA solutions





#### **Stream-Dataflow Acceleration\***

\*Published in ISCA 2017, Submitted to IEEE Micro Top-Picks 2018



#### Architectural Realization of Programmable Hardware Acceleration

- Workloads characteristics:
  - Regular streaming memory accesses with straightforward patterns
  - Computationally intensive with long execution phases
  - Ample data-level parallelism with large datapath
  - Small instruction footprints with simple control flow
- Accelerator architecture to accelerate data-streaming applications
  - Instantiates the hardware primitives from GenAccel model
    - Exploit all the five specialization principles
  - Stream-Dataflow high-performance compute substrate with *Dataflow* and *Stream* specialization components
  - Exposes a novel stream-dataflow ISA interface for programming the accelerator



### **Stream-Dataflow Acceleration**





Exploit common accelerator application behavior:

#### **Dataflow Computation**

 Stream-Dataflow Execution model – Abstracts typical accelerator computation phases

#### **Stream Patterns and Interface**

 Stream-Dataflow ISA encoding and Hardware-Software interface – Exposes parallelism available in these phases

#### **Synchronization Primitives**

 Barrier commands to facilitate data coordination and data consistency



### **Stream-Dataflow Acceleration**







# Outline





- Stream-Dataflow Execution Model
- Hardware-Software (ISA) Interface for Programmable Hardware Accelerator
- Stream-Dataflow Accelerator Architecture and Example program
- Stream-Dataflow Micro-Architecture *Softbrain*
- Evaluation and Results



Dataflov

Computation

Memory Stream











## **Stream-Dataflow Execution Model**



Architectural Abstractions for Stream-Dataflow Model



#### **Dissertation Talk**

# **Stream-Dataflow Execution Model**



W

Programmer Abstractions for Stream-Dataflow Model



 Computation abstraction – Dataflow Graph (DFG) with input/output vector ports

Data abstraction - Streams of data fetched

- Separates the data-movement from computation
- Achieves high-concurrency through the execution of coarser-grained data streams alongside dataflow computation





# Outline





- Stream-Dataflow Execution Model
- Hardware-Software (ISA) Interface for Programmable Hardware Accelerator
- Stream-Dataflow Accelerator Architecture and Example program
- Stream-Dataflow Micro-Architecture Softbrain
- Evaluation and Results



Dataflow

Computat

Aemory Stream















### Stream-Dataflow ISA Interface

Express any data-stream pattern of accelerator applications using simple, flexible and yet efficient encoding scheme



# **Stream-Dataflow ISA**

#### V E R T I C A L

#### • Set-up Interface:

**SD\_Config** – Configuration data stream for dataflow computation fabric (CGRA)

• Control Interface:

SD\_Barrier\_Scratch\_Rd, SD\_Barrier\_Scratch\_Wr, SD\_Barrier\_All

Stream Interface → SD\_[source]\_[dest]

Source/Dest Parameters: Address (memory or local\_storage), DFG Port number Pattern Parameters: access\_size, stride\_size, num\_strides





### Stream-Dataflow Programming Interface







# **Stream-Dataflow ISA Encoding**





### Dataflow:



11/16/2017



# Example Pseudo-Code: Dot Product



### **Original Program**

# for(int i = 0 to N) { c += a[i] \* b[i]; }



### **Stream ISA Encoding**

| Put    | a[0: N] | → P1 |
|--------|---------|------|
| Put    | b[0: N] | → P2 |
| Recur  | P3, N - | 1    |
| Get P3 | → c     |      |

### **Dataflow Encoding**





### New ISA Class for Programmable Hardware Acceleration



### **Stream-Dataflow ISA**

- Expresses long memory streams and access patterns efficiently
  - Address generation hardware becomes much simpler
- Decouples access and execute phases
- Reduces instruction overheads
- Dependences are explicitly encoded
- Reduces cache requests and pressure by encoding alias-free memory requests

   Implicit coalescing for concurrent memory accesses
- Separates architecture abstractions from the implementation details



### A New ISA Paradigm for Acceleration

- Need to embody common accelerator principles and execution model
- Need to represent programs without requiring complex micro-architecture techniques for performance
  - VLIW, SIMT and SIMD have their own drawbacks for accelerators
- Micro-Architecture for C-programmable ASICs
  - Enables 'hardened' ASIC compute substrate implementation
  - Separates the memory interface primitives and interaction



# Outline





- Stream-Dataflow Execution Model
- Hardware-Software (ISA) Interface for Programmable Hardware Accelerator
- Stream-Dataflow Accelerator Architecture and Example program
- Stream-Dataflow Micro-Architecture Softbrain
- Evaluation and Results



Dataflos











# Requirements for Stream-Dataflow Accelerator Architecture

1. Should employ the common specialization principles and hardware mechanisms explored in GenAccel model (\*IEEE Micro Top-Picks 2017: *Domain Specialization is Generally Unnecessary for Accelerators*)



 Programmability features without the inefficiencies of existing data-parallel architectures (with less power, area and control overheads)



R L

8

# Inefficiencies in Data-Parallel





- Vector architectures Efficient parallel memory interface
- Spatial Architectures Efficient parallel computation interface
  - Application/Domain Specific Architectures Efficient datapath for pipelined concurrent execution

| Irregular<br>execution<br>support | <ul> <li>Inefficient general<br/>pipeline</li> </ul> | <ul> <li>Warp divergence<br/>hardware support</li> </ul> | <ul> <li>Re-convergence<br/>for diverged<br/>vector threads</li> </ul> | -  |
|-----------------------------------|------------------------------------------------------|----------------------------------------------------------|------------------------------------------------------------------------|----|
| 11/16/2017                        |                                                      | Dissertation Talk                                        |                                                                        | 50 |



# Stream-Dataflow Accelerator Architecture Opportunities



- Reduce address generation & duplication overheads
- Distributed control to boost pipelined concurrent execution
- High utilization of execution resources w/o massive multithreading, reducing cache pressure or using multiported scratchpad
- Decouple access and execute phases of programs
- Simplest hardware fallback mechanism for irregular memory access support
- Able to be easily customizable/configurable for new application domain

#### **Stream Dataflow**





### **Stream-Dataflow Accelerator**



### **Dataflow:**

Acc(1)

**512b ---** 64b





B(3)

Programmable scratch page and supporting stream-e X or data y and data-reuse X

A(3)

- Memory stream-engine to racilitate data streaming in and of the accelerator
- + Recurrence st engine to support recurrent data stream
- Indirect vector port int for streaming addresses (indirect load, \_\_\_\_\_res)







11/16/2017

**Dissertation Talk** 



## Stream-Dataflow Accelerator Architecture Integration



### **Multi-Tile Stream-Dataflow Accelerator**



- Each tile is connected to higher-L2 cache interface
- Need a simple scheduler logic to schedule the offloaded streamdataflow kernels to each tile



## Programming Stream-Dataflow Accelerator



- 1. Specify Datapath for the CGRA
  - Simple Dataflow Language for DFG
- 2. Orchestrate the parallel execution of hardware components
  - Coarse-grained stream commands using the stream-interface





# **Classifier Layer (Original)**







### Dataflow Graph (DFG) for CGRA: *Classifier Kernel*





**Dissertation Talk** 



## Stream Dataflow Program: Classifier Kernel



// Configure the CGRA
SD\_CONFIG(class\_cfg, sizeof(class\_cfg));

// Stream the data from memory to ports
SD\_MEM\_PORT(synapse, 8, 8, Ni \* Nn/ 4, Port\_S);
SD\_MEM\_PORT(neuron\_i, 8, 8, Ni/4, Port\_N);

for (n = 0; n < Nn/nthreads; n++) {
 // Stream the constant values to constant ports
 SD\_CONST(Port\_acc, 0, 1);
 SD\_CONST(Port\_do\_sig, 0, Ni - 1);</pre>

// Recur the computed data back for accumulation
SD\_PORT\_PORT(Port\_out, N - 1, Port\_acc);

// Sigmoid computation and output neuron written
SD\_CONST(Port\_do\_sig, 1, 1);
SD\_PORT\_MEM(Port\_out, 2, 2, 1, &neuron\_n[n]);

SD\_BARRIER\_ALL();





# **Performance Considerations**



- Goal: Fully pipeline the largest dataflow graph
  - Increase performance [CGRA Instructions / Cycle]
  - Increase throughput [Graph computation instances per cycle]
- Primary Bottlenecks:
  - Computations per Size of Dataflow Graph
     Increase through Loop Unrolling/Vectorization
  - General Core (for Issuing Streams)
     Increase "length" of streams
  - Memory/Cache Bandwidth
     Use Scratchpad for data-reuse
  - Recurrence Serialization Overhead
     Increase Parallel Computations (tiling)



# Outline





- Stream-Dataflow Execution Model
- Hardware-Software (ISA) Interface for Programmable Hardware Accelerator
- Stream-Dataflow Accelerator Architecture and Example program
- Stream-Dataflow Micro-Architecture Softbrain
- Evaluation and Results



Dataflos

Computation

Memory Stream













### Micro-Architecture Design Principles

- 1. Low-overhead control structures
- 2. Efficient execution of concurrent stream commands with simple resource dependency tracking
  - 3. Not introduce power hungry or large CAM-like structures
    - 4. Parameterizable design



### Micro-Architecture of Stream-Dataflow Accelerator – *Softbrain*



#### **Dissertation Talk**

E R T

# Stream-Dispatcher of Softbrain





- Issues the stream commands to stream-engines
- Resource dependency tracking
  - Simple vector-port to stream-engine scoreboard mechanism
- Barriers Enforces the explicit stream-barriers for data-consistency in scratchpad as well as memory state
- Interfaces to the low-power core using a simple queue-based custom accelerator logic

#### **Dissertation Talk**



### Micro-Architecture of Stream-Dataflow Accelerator – *Softbrain*



E R T I C A L



Memory Stream-Engine (MSE)

Scratchpad Stream-Engine (SSE)

- Arbitration of multiple stream command requests
- Responsible for address generation for various data-stream access patterns
- Manages concurrent accesses to vector ports, scratchpad and the cache/memory hierarchy
- Dynamic switching of streams to account for L2 cache misses and maintain the high-bandwidth memory accesses



## Softbrain Stream-Engine Controller Request Pipeline





#### **Stream-Engine Controller**

#### Stream Request Pipeline

- Responsible for address generation for both *direct* and *indirect* data-streams
- Priority based selection among multiple queued data-steams
- Direct streams Affine Address Generation Unit (AGU) generates memory addresses
- Indirect Streams Non-affine AGU gets addresses, offsets from indirect vector ports

#### **Dissertation Talk**



### Micro-Architecture Flow of Softbrain





# Outline

- Overview
- Stream-Dataflow Execution Model
- Hardware-Software (ISA) Interface for Programmable Hardware Accelerator
- Stream-Dataflow Accelerator Architecture and Example program
- Stream-Dataflow Micro-Architecture Softbrain
- Evaluation and Results













## Stream-Dataflow Implementation: Softbrain





11/16/2017



# **Evaluation Methodology**



- Workloads
  - Deep Neural Networks (DNN) For domain provisioned comparison
  - Machsuite Accelerator Workloads For comparison with application specific accelerators
- Comparison
  - Domain Provisioned Softbrain vs. DianNao DSA
  - Broadly provisioned Softbrain vs. ASIC design points Aladdin\* generated performance, power and area
- Area and Power of Softbrain
  - Synthesized area, power estimates
  - CACTI for cache and SRAM estimates

\*Sophia, Shao et al. – Aladdin: a Pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures



## Domain-Specific Comparison (Softbrain vs DianNao DSA)







# Area-Power Estimates of Domain Provisioned Softbrain



Components Area (mm2) @ 28nm Power (mW)

Softbrain vs Diannao (DNN DSA)

- Perf. Able to match the performance
- Area 1.74x Overhead
- Power 2.28x Overhead

| 1 Softbrain Unit                | 0.47 | 119.3 |
|---------------------------------|------|-------|
| 8 Softbrain Units               | 3.76 | 954.4 |
| DianNao DSA                     | 2.16 | 418.3 |
| Softbrain / DianNao<br>Overhead | 1.74 | 2.28  |

11/16/2017

#### Broadly Provisioned Softbrain vs ASIC Performance Comparison



Aladdin\* generated ASIC design points – Resources constrained to be in ~15% of Softbrain Perf. to do iso-performance analysis

\*Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. Sophia Shao , .et. al

#### 11/16/2017

#### **Dissertation Talk**

#### Broadly Provisioned Softbrain vs ASIC Area & Power Comparison

ASIC Area Relative

#### **Softbrain vs ASIC designs**

**Energy Efficiency** 

- Perf. Able to match the performance
- Power **1.6x** overhead
- Energy **1.5x** overhead
- Area 8x overhead\*

**Power Efficiency Relative to** 

#### \*All 8 ASICs combined $\rightarrow$ 2.15x more area than Softbrain

| Surprain | ASIC | SUILDIAIII | ASIC | GM |
|----------|------|------------|------|----|
|          |      |            |      |    |



#### Conclusion – Stream-Dataflow Acceleration



- Stream-Dataflow Acceleration
  - Stream-Dataflow Execution Model Abstracts typical accelerator computation phases using a dataflow graph
  - Stream-Dataflow ISA Encoding and Hardware-Software Interface Exposes parallelism available in these phases
- Stream-Dataflow Accelerator Architecture
  - CGRA and vector ports for pipelined vector-dataflow computation
  - Highly parallel stream-engines for low-power stream communication
- Stream-Dataflow Prototype & Implementation Softbrain
  - Matches performance of domain provisioned accelerator (DianNao DSA) with ~2x overheads in area and power
  - Compared to application specific designs (ASICs), Softbrain has ~2x overheads in power and ~8x in area



#### **Dissertation Research Goal**



#### **Programmable Hardware Acceleration**

- 1. Explore the commonality in the way the DSAs specialize *Specialization Principles*
- 2. **General Mechanisms** for the design of a generic programmable hardware accelerator matching the efficiency of DSAs

3. A programmable/re-configurable accelerator architecture with an efficient accelerator hardware-software (ISA) interface

4. Easy adaptation of new acceleratable algorithms in a domain-agnostic way



#### Conclusion – *Programmable Hardware Acceleration*



d.

- New acceleration paradigm in specialization era
  - Programmable Hardware Acceleration breaking the limits of acceleration



A good enabler for exploring general purpose programmable hardware acceleration ....

- Reduce the orders of magnitude overheads of programmability and generality compared to ASICs
- Drives future accelerator research and innovation 11/16/2017 Dissertation Talk



#### **Future Work**



- Multiple DFG executions
  - Configuration cache for CGRA to switch between DFGs
- Further distribute the control into vector ports
  - Dynamic deadlock detection for buffer overflow
  - Concurrent execution of different set of streams (of different DFGs)
- Low-power dynamic credit-based CGRA schedule
  - Allow vector ports to run out-of-order reducing the overall latency
- 3D support for streams in ISA
- Partitioned scratchpad to support data dependent address generation
- Support for fine-grained configuration through FPGA slices (along with SRAM mats) next to CGRA for memory-dependent algorithm acceleration



#### **Related Work**



- Programmable specialization architectures:
  - Smart memories, Charm, Camel, Mosphosys, XLOOPS, Maven-VT
- Principles of Specialization
  - □ GPPs inefficient and need specialization Hameed. et. Al
  - Trace processing Beret
  - Transparent Specialization CCA, CRIB etc,
- Heterogeneous Cores GPP + Specialized engines
   *Composite cores, DySER, Cambricon*
- Streaming Engines:
  - □ RSVP arch, Imagine, Triggered instructions, MAD, CoRAM++



#### **Other Works**



- Open Source GPGPU MIAOW
  - Lead developer and contributor to open source hardware GPGPU MIAOW
  - AMD Southern Island based RTL implementation of GPGPU able to execute unmodified AMDAPP OpenCL kernels
  - Published in [ACM TACO 2015, HOTCHIPS' 2015, COOLCHIPS' 2015, HiPEAC' 2016]
- Von-Neumann/Dataflow Hybrid Architecture
  - A hybrid architecture aimed to exploit ILP in irregular applications
  - Lead developer of the micro-architecture of the dataflow offload engine Specialized Engine for Explicit Dataflow (SEED)
  - Published in [ISCA' 2015, IEEE MICRO Top Picks 2016]
- Open-source Hardware: Opportunities and Challenges
  - A position article on the advantages of open-source hardware for hardware innovation
  - Huge believer in open-source hardware and contribution
  - To be published in *IEEE Computer'* 17





#### **Back Up**



#### Programmable Hardware Acceleration



Idea 1: Specialization principles can be exploited in a general way

Idea 2: Composition of known *Micro-Architectural mechanisms* embodying the specialization principles

| P | ro | gra        | an         | hr | al | ole      | H | ar | dv | vare |
|---|----|------------|------------|----|----|----------|---|----|----|------|
|   |    | : <b>:</b> | : <b>:</b> |    |    |          |   |    |    | cel) |
|   |    | LC         |            | aı |    | <b>-</b> |   |    |    | LEIJ |

GenAccel as a programmable hardware design template to map **one** or **many** application domains

Deep Neural

**Domain provisioned GenAccel** 



**Balanced GenAccel** 

#### **Principles in DSAs**



#### NPU – Neural Proc. Unit



Computation

- Match hardware concurrency to that of algorithm
- Problem-specific **computation** units
- Explicit communication as opposed to implicit communication
- Customized structures for data reuse
- Hardware **coordination** using simple low-power control logic

11/16/2017

Concurrency

Dissertation Talk

Communication

Data Reuse

Coordination



#### **Accelerator Workloads**





- 1. Ample Parallelism
- 3. Large Datapath

- 2. Regular Memory
- 4. Computation Heavy

Dissertation Talk



### **GenAccel Modeling Strategy**



- Phase 1. Model Single-Core with PIN + Gem5 based trace simulation
  - The algorithm to specialize in the form of c-code/binary
  - Potential Core Types, CGRA sizes, any specialized instructions
  - Degree of memory customization (which memory accesses to be specialized, either with DMA or scratchpad)
  - Output: single-core perf./energy for "Pareto-optimal" designs
- Phase 2. Model coarse-grained parallelism
  - Use profiling information to determine parallel portion of the algorithm (or tell user to indicate or estimate)
  - Use simple Amdahl's law to get performance estimate
  - Use execution time, single-core energy estimate, and static power estimate to get overall energy estimate



#### **GenAccel in Practice**





#### **Programming GenAccel**







## **GenAccel Design Point Selection**



| Design                 | Concurrency                                  | Computation                      | Communication                    | Data Reuse                      | No. of<br>GenAccel<br>Units |
|------------------------|----------------------------------------------|----------------------------------|----------------------------------|---------------------------------|-----------------------------|
| GA <sub>N</sub>        | 24-tile CGRA<br>(8 Mul, 8 Add, 1 Sigmoid)    | 2k x 32b sigmoid<br>lookup table | 32b CGRA; 256b<br>SRAM interface | 2k x 32b<br>weight<br>buffer    | 1                           |
| GA <sub>c</sub>        | 64-tile CGRA<br>(32 Mul/Shift, 32 Add/logic) | Standard 16b<br>FUs              | 16b CGRA; 512b<br>SRAM interface | 512 x 16b<br>SRAM for<br>inputs | 1                           |
| <b>GA</b> <sub>D</sub> | 64-tile CGRA<br>(32 Mul, 32 Add, 2 Sigmoid)  | Piecewise linear sigmoid unit    | 32b CGRA; 512b<br>SRAM interface | 2k x 16b<br>SRAMs for<br>inputs | 8                           |
| <b>GA</b> <sub>Q</sub> | 32-tile CGRA<br>(16 ALU, 4 Agg, 4 Join)      | Join + Filter units              | 64b CGRA; 256b<br>SRAM interface | SRAMs for buffering             | 4                           |
| GA <sub>B</sub>        | 32-tile CGRA<br>(Combination of above)       | Combination of above FUs         | 64b CGRA; 512b<br>SRAM interface | 4KB SRAM                        | 8                           |

Mul: Multiplier, Add: Adder

**Dissertation Talk** 



#### Design-Time vs. Runtime Decisions



|               | Synthesis – Time                                                  | Run – Time                                                                         |
|---------------|-------------------------------------------------------------------|------------------------------------------------------------------------------------|
| Concurrency   | No. of GenAccel Units                                             | Power-gating unused GenAccel<br>Units                                              |
| Computation   | Spatial fabric FU mix                                             | Scheduling of spatial fabric and core                                              |
| Communication | Enabling spatial datapath<br>elements, & SRAM interface<br>widths | Configuration of spatial datapath,<br>switches and ports, memory access<br>pattern |
| Data Reuse    | Scratchpad (SRAM) size                                            | Scratchpad used as DMA/reuse<br>buffer                                             |



### Performance Analysis (1)



 $GA_N$  vs. NPU



**Dissertation Talk** 



#### **Source of Accelertion Benefits**







#### Algorithm/Concurrency

**Dissertation Talk** 

91



### **Performance Analysis (2)**







#### GenAccel Area & Power Numbers



|                   |                         | Area (mm²) | Power (mW) |
|-------------------|-------------------------|------------|------------|
| Neural<br>Approx. | GA <sub>N</sub>         | 0.37       | 149        |
|                   | NPU                     | 0.30       | 74         |
| <b>a</b> . 1      | GA <sub>c</sub>         | 0.15       | 108        |
| Stencil           | Conv. Engine            | 0.08       | 30         |
| Deep Neural.      | GA <sub>D</sub>         | 2.11       | 867        |
|                   | DianNao                 | 0.56       | 213        |
| Database          | GA <sub>Q</sub>         | 1.78       | 519        |
| Streaming         | Q100                    | 3.69       | 870        |
|                   | GA <sub>Balanaced</sub> | 2.74       | 352        |

\*Intel Ivybridge 3770K CPU 1 core Area – **12.9mm<sup>2</sup>** | Power – **4.95W** 

<sup>\*</sup>Intel Ivybridge 3770K iGPU 1 execution lane Area – 5.75mm<sup>2</sup>

<sup>+</sup>AMD Kaveri APU Tahiti based GPU 1CU Area – 5.02mm<sup>2</sup>

<sup>\*</sup>Source: http://www.anandtech.com/show/5771/the-intel-ivy-bridge-core-iZ-3770k-review/3 



### Power & Area Analysis (1)







#### Power & Area Analysis (2)









## Power & Area Analysis (3)

#### $LSSD_B \rightarrow Balanced LSSD design$





### Unsuitable Workloads for GenAccel /Stream-Dataflow



- Memory-dominated workloads
- Specifically small-memory footprint, but "irregular"
- Heavily serialized data dependent address generation
- Memory compression for example
  - A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs, Fower et. al
- Other examples:
  - IBM PowerEN Regular Expression
  - DFA based codes



#### **GenAccel vs. FPGA**



| Domain Kernel             | Number of States |
|---------------------------|------------------|
| NPU                       | 17               |
| <b>Convolution Engine</b> | 14               |
| Diannao                   | 86               |

- FPGAs are much lower frequency (global-routing and too fine-grained)
- BlockRAMs too small to gang-up
- Logical Multi-ported Register File needed to pass values between DSP slices to match high operand-level concurrency
- Altera's Stratix 10 seems headed exactly this direction





#### GenAccel's power overhead of 2x - 4x matter in a system with accelerator?

# In what scenarios you want to build DSA over GenAccel?



### **Energy Efficiency Tradeoffs**





Overall energy of the computation executed on system

$$E = Pacc * (U/S) * t + Psys * (1 - U + U/S) * t + Pcore * (1 - U) * t$$
Accel. energySystem energyCore energy

### Energy Efficiency Gains of GenAccel & DSA over OOO core



Efficiency gains of both GenAccel and DSA are almost similar & Baseline – 4, wide OOO core At higher speedups both get "capped" due to large system power 11/16/2017





# GenAccel's power overhead of 2x - 4x matter in a system with accelerator?

#### When P<sub>sys</sub> >> P<sub>ga</sub>, 2x - 4x power overheads of GenAccel become inconsequential



### Energy Efficiency Gains of DSA over GenAccel



Speedup<sub>ga</sub> = Speedup<sub>dsa</sub> (Speedup w.r.t OOO)



## AAtonigerespendents as **Realing Contracting 15% And 18% engand** feri Ancel

**Dissertation Talk** 





In what scenarios you want to build DSA over GenAccel?

Only when application speedups are small & small energy efficiency gains too important



#### When does accelerator power or DSA matter?



- GenAccel cannot match DSA for performance
- Accelerator is a "vertically-integrated" accelerator
  - Logic attached to memory or IO, that  $P_{sys}$  is affected
  - ShiDianNao for example (DNN attached to image sensor)
- Speedups are "small" and 10% energy difference is "valuable"



#### **Energy Efficiency Gains of** DianNao over GenAccel



Speedup<sub>GA</sub> = Speedup<sub>DianNao</sub> (Speedup w.r.t OOO)



**Dissertation Talk** 

### **Does Accelerator power matter?**



- At Speedups > 10x, DSA eff. is around 5%, when accelerator power == core power
- At smaller speedups, makes a bigger difference, up to 35%

**Dissertation Talk** 



Enqueued

Dispatched

R

**Resource idle** 

## Detailed Example of Stream-Dataflow Execution Model

Ο

Barrier

Dependency



### Stream-Dataflow Accelerator Potential

1. Dataflow based pipelined concurrent execution

#### 2. High Computation Activity Ratio: Number of Computations/Stream Commands

C6) Mem  $\rightarrow$  Port B

C7) All Barrier

CGRA fabric state

Low-power core state

11/16/2017

Program Order





# Example Code: Dot Product (Instruction Comparisons)



#### **Original Program**

for(int i = 0 to N) {
 dot\_prod += a[i] \* b[i]
}





| Scalar                                                                                         | Vector                                                                                                                         | <b>Stream-Dataflow</b>                                         |
|------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|
| <pre>for(i = 0 to N) {    Send a[i] -&gt; P1    Send b[i] -&gt; P2 } Get P3 -&gt; result</pre> | <pre>for(i = 0 to N, i+=vec_len) {    Send a[i:i+vec_len] -&gt; P1    Send b[i:i+vec_len] -&gt; P2 } Get P3 -&gt; result</pre> | Send a[i:i+N] -> P1<br>Send b[i:i+N] -> P2<br>Get P3 -> result |
| ~2N Instructions                                                                               | ~2N/vec_len Instructions                                                                                                       | ~3 Instructions                                                |



### Stream-Dataflow ISA vs. TPU ISA



#### **Google TPU ISA**

- Design goal of TPU ISA
  - To be a programmable ISA with less instruction overheads
- Restricted to neural networks domain only → More of programmable ISA for NN domain
- CISC principle to run complex tasks  $\rightarrow$  To run fast multiple-add accumulations
- Uses matrix as a primitive instead of vector or scalar
  - Huge performance benefit for neural network applications
  - Reduced latency for inference [< 7ms]
  - ISA restricted heavily for certain type of computations

Read\_Host\_Memory, Read\_Weights, MatrxMultiply/Convolve, Activate, Write\_Host\_Memory]

- Heavily relies on host processor to send the instructions. Host software will be a bottleneck
- Does not decouple the memory and computation phases
   Dissertation Talk



### **TPU Compute Capability**





- 700 Mhz target frequency with 40W TDP. External accelerator and PCIe based interconnect to host – 12.5GB/s effective bandwidth
- An inference chip for MLPs, CNN and LSTM → Matrix-Matrix multiplication support
   65K operations per cycle using a 256 x 256 systolic array 2D pipeline
- Quantization helps performance to operate on 8-bit integers only



### Potential Performance Bottlenecks



- 1. Computations Per CGRA Instance
- 2. General Core Instructions
- 3. Cache  $\rightarrow$  GRA Bandwidth
- 4. Initialization/Draining Latency (Memory & CGRA)
- 5. Length of Recurrence through CGRA



### 1. Computations Per CGRA Instance



**Principle:** Few instructions control many computation instances



# *HINT:* This usually involves unrolling a loop – but not necessarily the inner loop.



# 2. General Core Instructions



- **Principle:** Few core instructions control many computation instances
  - Use as long streams as possible
  - Computation Instances > 2 \* Number of Commands





# 3. Cache → CGRA Bandwidth (1)



- Principle 1: Only 64-bytes per cycle can come from memory
  - Can feed One 8-wide port, Two 4-wide ports, Four 2-wide ports
  - Use scratch streams to supplement memory streams





# 3. Cache → CGRA Bandwidth (2)



• **Principle 2:** Not-accessed elements within a 64-byte cache line **COUNT** towards bandwidth

Stream: access\_size = 16 bytes stride\_size = 24 bytes

| Address Pattern: | 16 | 8  | 16 | 8 | 8 |  |
|------------------|----|----|----|---|---|--|
| Cache Line Size: |    |    |    |   |   |  |
|                  |    | 64 | Ļ  |   |   |  |

**HINT 1:** Don't use access patterns with "gaps" smaller than the cache line size.

HINT 2: Try to align accesses with cache line boundaries



# **Optimizing Classifier Layer**







Ni \* 2, 1, Port N);



SD\_Barrier\_All;

#### Optimization: Scratch for Memory B/W

11/16/2017

**Dissertation Talk** 



### 6. Initialization/Draining Latency (Memory & CGRA)



Principle: Hide memory latency by having "longer pipelined phases"





# 7. Length of Recurrence through CGRA



• **Principle:** Number of independent instances should be > the length of the longest recurrence.





### 7. Length of Recurrence through CGRA (2)





# **Recurrence Serialization Overhead**



#### Maximum Computation Rate = **#** Pipelinable Instances / Recurrence Length





# **Pipelining Classifier Layer**







### **2D Stencil Example**







### "Easy" Approach



```
Stencil Array
       Input Array
                                                        Output Array
                                          Σ
                         X
for (r = 0; r < row_size - 2; r++) {</pre>
  for (c = 0; c < col size - 2; c++) {</pre>
    SD_Constant(P_stencil_sb_carry, 1, 1);
    for (k1 = 0; k1 < 3; k1++) {
      SD_Mem_Port((orig + (r + k1) * col_size + c))
                sizeof(TYPE), sizeof(TYPE), 4, P_stencil_sb_I);
      SD Mem Port(filter + (k1 * 3),
                sizeof(TYPE), sizeof(TYPE), 4, P_stencil_sb_F);
    }
    SD_port_Port(P_stencil_sb_R, P_stencil_sb_carry, 2);
    SB_Port_Mem(P_stencil_sb_R, sizeof(TYPE),
               sizeof(TYPE), 1, sol + (r * col_size) + c);
SB_Barrier_All();
                                                                        24
```



# Easy Approach's Bottlenecks



- 1. Computations Per CGRA Instance (only 3 mults!)
- 2. General Core Instructions (core insts == CGRA insts)
- 3. Cache → CGRA Bandwidth (wasted b/c of acc\_size)
- 4. Initialization/Draining Latency
- 5. Length of Recurrence through CGRA

(no independent computations through CGRA)

Х



#### Input Array

Stencil Array

Output Array



Х



#### Input Array

|  | <br> |  |  |
|--|------|--|--|
|  |      |  |  |
|  |      |  |  |
|  |      |  |  |
|  |      |  |  |
|  |      |  |  |

Stencil Array

Output Array











**Output Array** 









# **Better Approach's Bottlenecks**



- 1. Computations Per CGRA Instance (up to 8 mults!)
- 2. General Core Instructions (core insts << CGRA insts)
- 3. Cache → CGRA Bandwidth (acc\_size > cache\_size)
- 4. Scratchpad  $\rightarrow$  CGRA Bandwidth
- 5. Memory  $\rightarrow$  Cache Bandwidth
- 6. Initialization/Draining Latency
- 7. Length of Recurrence through CGRA (if you stripmine the c-loop past the DFG width, you can stream multiple independent computations through the CGRA!)



# **Programming Restrictions**



- CGRA Instruction Types & Data-width
- Shape of the stream (strided)
- Width of input/output ports
- Number of simultaneous streams
- Issue to free-port (data always balanced)



## **Pipelining Classifier Layer**







### CGRA – Vector Port Interface





4 Entry Vector Port (512b or 64B wide) – Each element 8B or 64b)

- Vector ports facilitate "vector/SIMD execution and can store entire cache-line in a cycle (8 wide)
- Vector ports' offsets are connected to CGRA input links – Mapping done by hardware architects recorded as *Softbrain Hardware Parameter Model*
- Hardware parameter model is passed to scheduler/compiler for mapping software DFG ports to hardware vector ports
- Enable flexible hardware-software interface for variable width SIMD-execution

Input Vector Port Interface



Output Vector Port Interface

Example vector port to CGRA links mapping [VPORT\_Num]: [Offset]:[CGRA Link Num]

| VPORT_IN 0: | 0:2, 1:5, 2:8, 3:11, 4:17, 5:20, 6:23, 7:26  |
|-------------|----------------------------------------------|
| VPORT_IN 1: | 0:4, 1:7, 2:10, 3:16, 4:19, 5:22, 6:25, 7:31 |
| VPORT_OUT 0 | : 0:1, 1:3, 2:5, 3:6, 4:8, 5:9, 6:11, 7:12   |

### Workload Characterization for Application Specific Softbrain



| Implemented Codes | Stream Patterns                         | Datapath                       |  |
|-------------------|-----------------------------------------|--------------------------------|--|
| bfs               | Indirect Loads/Stores, Recurrence       | Compare/Increment              |  |
| gemm              | Affine, Recurrence                      | 8-Way Multiply-Accumulate      |  |
| md-knn            | Indirect Loads, Recurrence              | Large Irregular Datapath       |  |
| spmv-crs          | Indirect, Linear                        | Single Multiply-Accumulate     |  |
| spmv-ellpack      | Indirect, Linear, Recurrence            | 4-Way Multiply-Accumulate      |  |
| stencil2d         | Affine, Recurrence                      | 8-Way Multiply-Accumulate      |  |
| stencil3d         | Affine                                  | 6-1 Reduce and Multiplier Tree |  |
| viterbi           | Recurrence, Linear                      | 4-Way Add-Minimize Tree        |  |
| Unsuitable Codes  | Reason                                  |                                |  |
| aes               | Byte-level data manipulation            |                                |  |
| kmp               | Multi-level indirect pointer access     |                                |  |
| merge-sort        | Fine-grain data-dependent loads/control |                                |  |
| radix-sort        | Concurrent reads/writes to same address |                                |  |

### Softbrain vs. DianNao vs. GPU







### **ASIC Area Relative to Softbrain**













### Softbrain vs. ASIC Energy Efficiency Comparison







### Design Space Exploration for ASIC Comparison







### **DSA Architectures**







a) 8-PE NPU

b) Single processing engine (PE)

NPU



Q100



#### **Convolution Engine**



#### 11/16/2017

## **Convolutional Neural Network**







### **Rocket Core RoCC Interface**







### **Recurrent Neural Network**









#### **Specialization Spectrum**



#### More gains the lower you go

