

Tony Nowatzki<sup>+</sup>, **Vinay Gangadhar\***, Newsha Ardalani\*, Karu Sankaralingam\*

> 44<sup>th</sup> ISCA, Toronto, ON, Canada Accelerator Session (6A-4) Tuesday June 27<sup>th</sup>, 2017

\*University of Wisconsin-Madison \*University of California, Los Angeles



#### **Traditional Multicore**



Application domain specialization











Out Fifo

Processing ingine (PE

UCLA

#### **Traditional Multicore**



SR



#### **Traditional Multicore**



Application domain specialization







**NVIDIA DGX-1 AI Accelerator** 







#### **Traditional Multicore**



Application domain specialization

#### **Domain Specific Acceleration**



Fixed-function Accelerators for specific domain: **Domain Specific Accelerators (DSAs)** 

#### + High Efficiency

10 – 100x Performance/Power or Performance/Area three orders of magnitude less energy than a state of the art software DBMS, while the performance-oriented design outperforms the same DBMS by **70X** 

sor, the accelerator is **117X** faster, and it can reduce the total energy by **21X** The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in

UCI A



#### **Traditional Multicore**



Application domain specialization

#### **Domain Specific Acceleration**

**UCLA** 



Fixed-function Accelerators for specific domain: **Domain Specific Accelerators (DSAs)** 

#### + High Efficiency

10 – 100x Performance/Power or Performance/Area

- Not programmable/re-configurable & Obsoletion prone
- Architecture, design, verification and fabrication cost
- Multi-DSA chip for "N" application domains Area and cost inefficient 2

June 27, 2017

# The Universal Accelerator Dream...



#### matching the efficiency of Domain Specific Accelerators (DSAs) with an efficient hardware-software interface

June 27, 2017



#### Generality

| ASIC/<br>DSA | GPGPU |  |  |  |  |
|--------------|-------|--|--|--|--|
| Efficiency   |       |  |  |  |  |

(energy efficient computing)



#### Generality





#### Generality









#### Generality





#### **Background Work\***

\*IEEE Micro Top-Picks 2017: Domain Specialization is Generally Unnecessary for Accelerators





#### Our Work: Stream-Dataflow Acceleration

Exploit common accelerator application behavior:



#### Our Work: Stream-Dataflow Acceleration

Exploit common accelerator application behavior:

#### **Dataflow Computation**

- Stream-Dataflow Execution model
  - Abstracts typical accelerator computation phases





#### Our Work: Stream-Dataflow Acceleration



Exploit common accelerator application behavior:

#### **Dataflow Computation**

- Stream-Dataflow Execution model
  - Abstracts typical accelerator computation phases

#### **Stream Patterns and Interface**

- Stream-Dataflow ISA encoding and Hardware-Software interface
  - Exposes parallelism available in these phases





Programmable Stream-Dataflow Accelerator

















#### Motivation and Overview

• Stream-Dataflow Execution Model

Hardware-Software Interface and Example program

• Stream-Dataflow Accelerator Architecture

• Evaluation and Results





Dataflov

Memory Stream

address nattern len









Outline

#### Motivation and Overview

• Stream-Dataflow Execution Model

Hardware-Software Interface and Example program

• Stream-Dataflow Accelerator Architecture

• Evaluation and Results



**UCLA** 



Computation

Memory Stream

address nattern lens









Programmer Abstractions for Stream-Dataflow Model

#### UCLA

## **Stream-Dataflow Execution Model**

Programmer Abstractions for Stream-Dataflow Model

 Computation abstraction – Dataflow Graph (DFG) with input/output vector ports





Programmer Abstractions for Stream-Dataflow Model

 Computation abstraction – Dataflow Graph (DFG) with input/output vector ports



## Stre

### **Stream-Dataflow Execution Model**

Programmer Abstractions for Stream-Dataflow Model



- Computation abstraction Dataflow Graph (DFG) with input/output vector ports
- Data abstraction Streams of data fetched from memory and stored back to memory

Programmer Abstractions for Stream-Dataflow Model



- Computation abstraction Dataflow Graph (DFG) with input/output vector ports
- **Data abstraction** Streams of data fetched from memory and stored back to memory



Programmer Abstractions for Stream-Dataflow Model



- Computation abstraction Dataflow Graph (DFG) with input/output vector ports
- Data abstraction Streams of data fetched from memory and stored back to memory
- Reuse abstraction Streams of data fetched once from memory, stored in local storage (programmable scratchpad) and reused again



Programmer Abstractions for Stream-Dataflow Model



- Computation abstraction Dataflow Graph (DFG) with input/output vector ports
- Data abstraction Streams of data fetched from memory and stored back to memory
- *Reuse abstraction* Streams of data fetched once from memory, stored in local storage (programmable scratchpad) and reused again



Programmer Abstractions for Stream-Dataflow Model



- Computation abstraction Dataflow Graph (DFG) with input/output vector ports
- Data abstraction Streams of data fetched from memory and stored back to memory
- Reuse abstraction Streams of data fetched once from memory, stored in local storage (programmable scratchpad) and reused again
- Communication abstraction Stream-Dataflow
   data movement commands and barriers





Programmer Abstractions for Stream-Dataflow Model



- Computation abstraction Dataflow Graph (DFG) with input/output vector ports
- Data abstraction Streams of data fetched from memory and stored back to memory
- Reuse abstraction Streams of data fetched once from memory, stored in local storage (programmable scratchpad) and reused again
- Communication abstraction Stream-Dataflow
   data movement commands and barriers
   Time



Programmer Abstractions for Stream-Dataflow Model



- Computation abstraction Dataflow Graph (DFG) with input/output vector ports
- Data abstraction Streams of data fetched from memory and stored back to memory
- *Reuse abstraction* Streams of data fetched once from memory, stored in local storage (programmable scratchpad) and reused again
- Communication abstraction Stream-Dataflow
   data movement commands and barriers
   Time

Read Data



Programmer Abstractions for Stream-Dataflow Model



- Computation abstraction Dataflow Graph (DFG) with input/output vector ports
- Data abstraction Streams of data fetched from memory and stored back to memory
- Reuse abstraction Streams of data fetched once from memory, stored in local storage (programmable scratchpad) and reused again
- Communication abstraction Stream-Dataflow
   data movement commands and barriers
   Time





Programmer Abstractions for Stream-Dataflow Model



- Computation abstraction Dataflow Graph (DFG) with input/output vector ports
- Data abstraction Streams of data fetched from memory and stored back to memory
- Reuse abstraction Streams of data fetched once from memory, stored in local storage (programmable scratchpad) and reused again
- Communication abstraction Stream-Dataflow
   data movement commands and barriers
   Time





# **Stream-Dataflow Execution Model**

Programmer Abstractions for Stream-Dataflow Model



- Computation abstraction Dataflow Graph (DFG) with input/output vector ports
- Data abstraction Streams of data fetched from memory and stored back to memory
- Reuse abstraction Streams of data fetched once from memory, stored in local storage (programmable scratchpad) and reused again
- Communication abstraction Stream-Dataflow
   data movement commands and barriers
   Time



ISCA 2017 Stream-Dataflow Acceleration Talk



# **Stream-Dataflow Execution Model**

Programmer Abstractions for Stream-Dataflow Model



- **Computation abstraction** Dataflow Graph (DFG) with input/output vector ports
- **Data abstraction** Streams of data fetched from memory and stored back to memory
- **Reuse abstraction** Streams of data fetched once from memory, stored in local storage (programmable scratchpad) and reused again
- **Communication abstraction** Stream-Dataflow data movement commands and barriers Time





# **Stream-Dataflow Execution Model**

Programmer Abstractions for Stream-Dataflow Model



 Computation abstraction – Dataflow Graph (DFG) with input/output vector ports

- Data abstraction Streams of data fetched from memory and stored back to memory
- *Reuse abstraction* Streams of data fetched once from memory, stored in local storage (programmable scratchpad) and reused again
- Communication abstraction Stream-Dataflow
   data movement commands and barriers
   Time



Motivation and Overview

• Stream-Dataflow Execution Model

Hardware-Software Interface and Example program

Outline

• Stream-Dataflow Accelerator Architecture

• Evaluation and Results





Dataflov







Motivation and Overview 

Stream-Dataflow Execution Model 

Hardware-Software Interface and Example program

Stream-Dataflow Accelerator Architecture

**Evaluation and Results** 



Memory Stream

address nattern lens

Dataflov

Computation







Outline







### Stream-Dataflow ISA Interface

Express any data-stream pattern of accelerator applications using simple, flexible and yet efficient encoding





- Set-up Interface:
  - **SD\_Config** Configuration data stream for dataflow computation fabric (CGRA)



#### • Set-up Interface:

**SD\_Config** – Configuration data stream for dataflow computation fabric (CGRA)

• Control Interface:

SD\_Barrier\_Scratch\_Rd, SD\_Barrier\_Scratch\_Wr, SD\_Barrier\_All



#### • Set-up Interface:

**SD\_Config** – Configuration data stream for dataflow computation fabric (CGRA)

• Control Interface:

SD\_Barrier\_Scratch\_Rd, SD\_Barrier\_Scratch\_Wr, SD\_Barrier\_All

Stream Interface → SD\_[source]\_[dest]



#### • Set-up Interface:

**SD\_Config** – Configuration data stream for dataflow computation fabric (CGRA)

#### • Control Interface:

SD\_Barrier\_Scratch\_Rd, SD\_Barrier\_Scratch\_Wr, SD\_Barrier\_All

#### Stream Interface → SD\_[source]\_[dest]

| Command Name               | Parameters                                                                  | Description                                                  |
|----------------------------|-----------------------------------------------------------------------------|--------------------------------------------------------------|
| SD_Config                  | Address, Size                                                               | Stream CGRA configuration from given address                 |
| SD_Mem_Scratch             | Source Mem Address, Stride, Access Size, Num Strides, Dest. Scratch Address | Read from memory with pattern to scratchpad                  |
| $SD\_Scratch\_Port$        | Source Scratch Address, Stride, Access Size, Strides, Input Port $\#$       | Read from scratchpad with pattern to input port              |
| SD_Mem_Port                | Source Mem Address, Stride, Access Size, Num Strides, Input Port $\#$       | Read from memory with pattern to input port                  |
| SD_Const_Port              | Constant Value, Num Elements, Input Port $\#$                               | Send constant value to input port                            |
| $SD_Clean_Port$            | Num Elements, Output Port $\#$                                              | Throw away some elements from output port                    |
| SD_Port_Port               | Output Port $\#$ , Num Elements, Input Port $\#$                            | Issue recurrence between input-output port pairs             |
| $SD_Port_Scratch$          | Output Port $\#$ , Num Elements, Scratch Address                            | Write from port to scratchpad                                |
| $SD_Port_Mem$              | Output Port #, Stride, Access Size, Num Strides, Dest. Mem Address          | Write from port to memory with pattern                       |
| SD_Mem_IndPort             | Source Mem Address, Stride, Access Size, Num Strides, Indirect Port $\#$    | Read the addresses from memory with pattern to indirect port |
| $SD_IndPort_Port$          | Indirect Port #, Offset Address, Input Port #                               | Indirect load from addresses present in indirect port        |
| ${\rm SD\_IndPort\_Mem}$   | Indirect Port #, Output Port #, Dest. Offset Address                        | Indirect store to addresses present in indirect port         |
| SD_Barrier_Scratch_Rd      |                                                                             | Barrier for scratchpad reads                                 |
| $SD\_Barrier\_Scratch\_Wr$ | -                                                                           | Barrier for scratchpad writes                                |
| SD_Barrier_All             | ~                                                                           | Barrier to wait for all commands completion                  |



- Set-up Interface:
  - SD\_Config Configuration data stream for dataflow computation fabric (CGRA)
- Control Interface:

SD\_Barrier\_Scratch\_Rd, SD\_Barrier\_Scratch\_Wr, SD\_Barrier\_All

Stream Interface → SD\_[source]\_[dest]





#### • Set-up Interface:

**SD\_Config** – Configuration data stream for dataflow computation fabric (CGRA)

• Control Interface:

SD\_Barrier\_Scratch\_Rd, SD\_Barrier\_Scratch\_Wr, SD\_Barrier\_All

• Stream Interface → SD\_[source]\_[dest]





#### **Access Pattern**

#### Source

Memory, Local Storage, DFG Port Destination

UCLA

Memory, Local Storage, DFG Port

























# **Stream-Dataflow ISA Encoding**

### Stream:



## Dataflow:



# **Stream-Dataflow ISA Encoding**

### Stream:



## Dataflow:



*Specified in a Domain Specific Language (DSL)* 



## Stream:

Stream Encoding <address, access\_size, stride\_size, length>



## Dataflow:



*Specified in a Domain Specific Language (DSL)* 



# **Stream-Dataflow ISA Encoding**



Stream Encoding <address, access\_size, stride\_size, length>

### Dataflow:



*Specified in a Domain Specific Language (DSL)* 







*Specified in a Domain Specific Language (DSL)* 







*Specified in a Domain Specific Language (DSL)* 







*Specified in a Domain Specific Language (DSL)* 







*Specified in a Domain Specific Language (DSL)* 







*Specified in a Domain Specific Language (DSL)* 



## **Example Code: Dot Product**

### **Original Program**





## **Example Code: Dot Product**

### **Original Program**

for(int i = 0 to N) {
 c += a[i] \* b[i];
}



#### **Dataflow Encoding**





# **Example Code: Dot Product**



#### **Stream ISA Encoding**

P3

- Outline
- Motivation and Overview

• Stream-Dataflow Execution Model

Hardware-Software Interface and Example program

• Stream-Dataflow Accelerator Architecture

• Evaluation and Results



Memory Stream (sociest, pattern, length) Reuse (sociest, pattern, length) (sociest, pattern, len











Motivation and Overview

• Stream-Dataflow Execution Model

Hardware-Software Interface and Example program

• Stream-Dataflow Accelerator Architecture

• Evaluation and Results



UCL A











# Requirements for Stream-Dataflow UCLA Accelerator Architecture

1. Should employ the common specialization principles and hardware mechanisms

(\*IEEE Micro Top-Picks 2017: Domain Specialization is Generally Unnecessary for Accelerators)



 Programmability features without the inefficiencies of existing data-parallel architectures\* (with less power, area and control overheads)

\*More detailed analysis contrasting data-parallel architectures and stream-dataflow architecture in paper

## Stream-Dataflow Accelerator -- 64b Architecture

**—** 512b **———** 64b



## Stream-Dataflow Accelerator Architecture

#### **——** 512b **———** 64b

#### **Dataflow:**

• Coarse grained reconfigurable architecture (CGRA) for data parallel execution





**—** 512b **–––** 64b

#### **Dataflow:**

- Coarse grained reconfigurable architecture (CGRA) for data parallel execution
- Direct vector port interface into and out of CGRA for vector execution





**—** 512b **–––** 64b

#### **Dataflow:**

- Coarse grained reconfigurable architecture (CGRA) for data parallel execution
- Direct vector port interface into and out of CGRA for vector execution







**—** 512b **–––** 64b

#### **Dataflow:**

- Coarse grained reconfigurable architecture (CGRA) for data parallel execution
- Direct vector port interface into and out of CGRA for vector execution





**—** 512b **–––** 64b

#### **Dataflow:**

- Coarse grained reconfigurable architecture (CGRA) for data parallel execution
- Direct vector port interface into and out of CGRA for vector execution

#### **Stream Interface:**





## Architecture

**—** 512b **–––** 64b

### **Dataflow:**

- Coarse grained reconfigurable architecture (CGRA) for data parallel execution
- Direct vector port interface into and out of CGRA for vector execution

### **Stream Interface:**

 Programmable scratchpad and supporting stream-engine for data-locality and data-reuse





## Architecture

512b --- 64b

### **Dataflow:**

- Coarse grained reconfigurable architecture (CGRA) for data parallel execution
- Direct vector port interface into and out of CGRA for vector execution

### **Stream Interface:**

- Programmable scratchpad and supporting stream-engine for data-locality and data-reuse
- Memory stream-engine to facilitate data streaming in and out of the accelerator





## Architecture

512b --- 64b

### **Dataflow:**

- Coarse grained reconfigurable architecture (CGRA) for data parallel execution
- Direct vector port interface into and out of CGRA for vector execution

### **Stream Interface:**

- Programmable scratchpad and supporting stream-engine for data-locality and data-reuse
- Memory stream-engine to facilitate data streaming in and out of the accelerator
- Recurrence stream-engine to support recurrent data stream



UCL A



## Architecture

512b --- 64b

### **Dataflow:**

- Coarse grained reconfigurable architecture (CGRA) for data parallel execution
- Direct vector port interface into and out of CGRA for vector execution

### **Stream Interface:**

- Programmable scratchpad and supporting stream-engine for data-locality and data-reuse
- Memory stream-engine to facilitate data streaming in and out of the accelerator
- Recurrence stream-engine to support recurrent data stream
- Indirect vector port interface for streaming addresses (indirect load/stores)



UCL A



ISCA 2017 Stream-Dataflow Acceleration Talk

June 27, 2017



ISCA 2017 Stream-Dataflow Acceleration Talk



Outline

Motivation and Overview

Stream-Dataflow Execution Model 

Hardware-Software Interface and Example program

Stream-Dataflow Accelerator Architecture 

**Evaluation and Results** 

Dataflov

Computation

Memory Stream

address natiena lene











Motivation and Overview

• Stream-Dataflow Execution Model

Hardware-Software Interface and Example program

• Stream-Dataflow Accelerator Architecture

• Evaluation and Results



Memory Stream

address natiena lene

Dataflow

Computation





21





# Stream-Dataflow Implementation: UCLA Softbrain







- Workloads
  - Deep Neural Networks (DNN) For domain provisioned comparison
  - Machsuite Accelerator Workloads For comparison with application specific accelerators

- Comparison
  - Domain Provisioned Softbrain vs. DianNao DSA
  - Broadly provisioned Softbrain vs. ASIC design points *Aladdin* generated performance, power and area

UCI A

# Domain-Specific Accelerator UCLA Comparison (Softbrain vs DianNao)



# Domain-Specific Accelerator UCLA Comparison (Softbrain vs DianNao)



DianNao Area: 2.16 mm<sup>2</sup> DianNao Power: 420 mW

### Softbrain Area: **3.76 mm<sup>2</sup>** Softbrain Power: **950 mW**





## Softbrain vs ASIC Designs Comparison

**UCLA** 



Aladdin\* generated ASIC design points – Resources constrained to be in ~15% of Softbrain Perf. to do iso-performance analysis

\*Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. Sophia Shao , .et. al June 27, 2017 ISCA 2017 Stream-Dataflow Acceleration Talk 25



## **Softbrain vs ASIC Comparison**

## Power Efficiency Relative to OOO4 (GM)



Energy Efficiency Relative to OOO4 (GM)



ASIC Area Relative to Softbrain (GM)





## Softbrain vs ASIC Comparison

Power Efficiency Relative to OOO4 (GM)

Energy Efficiency Relative to OOO4 (GM) ASIC Area Relative to Softbrain (GM)

### **Softbrain vs ASIC designs**

- Perf. Able to match the performance
- Power ~1.6x overhead
- Energy Efficiency ~1.5x overhead
- Area ~8x overhead\*

\*All 8 ASICs combined  $\rightarrow$  2.15x more area than Softbrain





- Stream-Dataflow Acceleration
  - Stream-Dataflow Execution Model Abstracts typical accelerator computation phases using a dataflow graph
  - Stream-Dataflow ISA Encoding and Hardware-Software Interface Exposes parallelism available in these phases



- Stream-Dataflow Acceleration
  - Stream-Dataflow Execution Model Abstracts typical accelerator computation phases using a dataflow graph
  - Stream-Dataflow ISA Encoding and Hardware-Software Interface Exposes parallelism available in these phases
- Stream-Dataflow Accelerator Architecture
  - CGRA and vector ports for pipelined vector-dataflow computation
  - Highly parallel stream-engines for low-power stream communication

- Stream-Dataflow Acceleration
  - Stream-Dataflow Execution Model Abstracts typical accelerator computation phases using a dataflow graph
  - Stream-Dataflow ISA Encoding and Hardware-Software Interface Exposes parallelism available in these phases
- Stream-Dataflow Accelerator Architecture
  - CGRA and vector ports for pipelined vector-dataflow computation
  - Highly parallel stream-engines for low-power stream communication
- Stream-Dataflow Prototype & Implementation Softbrain
  - Matches performance of domain provisioned accelerator (DianNao DSA) with ~2x overheads in area and power
  - Compared to application specific designs (ASICs), Softbrain has ~2x overheads in power and ~8x in area



- Stream-Dataflow Acceleration
  - Stream-Dataflow Execution Model Abstracts typical accelerator computation phases using a dataflow graph
  - Stream-Dataflow ISA Encoding and Hardware-Software Interface –

### **Getting There !!**

A good enabler for exploring general purpose programmable hardware acceleration ....

 Compared to application specific designs (ASICs), Softbrain has ~2x overheads in power and ~8x in area

n



## Backup













C[i] = A[i] \* B[i]



C[i] = A[i] \* B[i]











ISCA 2017 Stream-Dataflow Acceleration Talk

### UCLA **1** Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued 0 Barrier Dispatched 0 Dependency Resource idle Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch ••**O**•; Scratchpad Program Order Х С **CGRA** fabric state Low-power core state Command June 27, 2017 generation 2017 Stream-Dataflow Acceleration Talk 30

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched 0 Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier Program Order C3) Scratch $\rightarrow$ Port A Х С **CGRA** fabric state Low-power core state Command June 27, 2017 generation 2017 Stream-Dataflow Acceleration Talk 30

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched 0 Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier Program Order C3) Scratch $\rightarrow$ Port A X C4) Mem $\rightarrow$ Port B С **CGRA** fabric state Low-power core state Command

June 27, 2017

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched 0 Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier Program Order C3) Scratch $\rightarrow$ Port A X C4) Mem $\rightarrow$ Port B C5) Port C $\rightarrow$ Mem С П. О..... **CGRA** fabric state Low-power core state Command

June 27, 2017

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched 0 Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier Program Order C3) Scratch $\rightarrow$ Port A X C4) Mem $\rightarrow$ Port B C5) Port C $\rightarrow$ Mem С П. О..... C6) Mem $\rightarrow$ Port B **CGRA** fabric state Low-power core state Command

June 27, 2017

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched $\mathbf{O}$ Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier Program Order C3) Scratch $\rightarrow$ Port A ..... Х C4) Mem $\rightarrow$ Port B C5) Port C $\rightarrow$ Mem С П. О..... C6) Mem $\rightarrow$ Port B C7) All Barrier **CGRA** fabric state Low-power core state Command generation 2017 Stream-Dataflow Acceleration Talk

June 27, 2017

# Image: Stream-Dataflow Execution Model UCLA Detailed Example Image: Stream-Dataflow Execution Model Image: Dependency Image: Dispatched O Dependency Resource idle Iter. boundary

Time

**n**.....





C7) All Barrier CGRA fabric state Low-power core state

June 27, 2017

Program Order

All data at dest.

**Stream Commands** 

C1) Mem  $\rightarrow$  Scratch

C2) Scratch Wr Barrier

C3) Scratch  $\rightarrow$  Port A

C4) Mem  $\rightarrow$  Port B

C5) Port C  $\rightarrow$  Mem

C6) Mem  $\rightarrow$  Port B

#### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched $\mathbf{O}$ Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier **n**..... Program Order C3) Scratch $\rightarrow$ Port A Х C4) Mem $\rightarrow$ Port B C5) Port C $\rightarrow$ Mem С C6) Mem $\rightarrow$ Port B **D**.....**O**

CGRA fabric state Low-power core state

C7) All Barrier

June 27, 2017

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched $\mathbf{O}$ Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier **n**..... Program Order C3) Scratch $\rightarrow$ Port A X C4) Mem $\rightarrow$ Port B C5) Port C → Mem С C6) Mem $\rightarrow$ Port B C7) All Barrier

CGRA fabric state Low-power core state

June 27, 2017

Command generation 2017 Stream-Dataflow Acceleration Talk

Processing

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched $\mathbf{O}$ Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier **n**..... Program Order C3) Scratch $\rightarrow$ Port A X C4) Mem $\rightarrow$ Port B C5) Port C $\rightarrow$ Mem С C6) Mem $\rightarrow$ Port B C7) All Barrier **CGRA** fabric state Processing Low-power core state Command

generation 2017 Stream-Dataflow Acceleration Talk

June 27, 2017

30

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched $\mathbf{O}$ Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier **n**..... Program Order C3) Scratch $\rightarrow$ Port A X C4) Mem $\rightarrow$ Port B C5) Port C $\rightarrow$ Mem С **D**··**O**····· C6) Mem $\rightarrow$ Port B 0 C7) All Barrier **CGRA** fabric state Processing Low-power core state Command

June 27, 2017

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched $\mathbf{O}$ Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier **n**..... Program Order C3) Scratch $\rightarrow$ Port A X C4) Mem $\rightarrow$ Port B C5) Port C $\rightarrow$ Mem С C6) Mem $\rightarrow$ Port B 0 C7) All Barrier **CGRA** fabric state Processing Low-power core state Command

June 27, 2017

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched $\cap$ Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier **n**..... Program Order C3) Scratch $\rightarrow$ Port A X C4) Mem $\rightarrow$ Port B C5) Port C $\rightarrow$ Mem С C6) Mem $\rightarrow$ Port B Ó C7) All Barrier **CGRA** fabric state Processing Low-power core state Command

June 27, 2017

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched $\cap$ Dependency **Resource idle** Iter. boundary / Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier Program Order C3) Scratch $\rightarrow$ Port A C4) Mem $\rightarrow$ Port B Х 0.0.4 C5) Port C $\rightarrow$ Mem С **D**··**O**····· C6) Mem $\rightarrow$ Port B 0 C7) All Barrier **CGRA** fabric state Processing

### Command

Low-power core state

June 27, 2017

### UCLA Stream-Dataflow Execution Model **Detailed Example** Legend: Enqueued $\odot$ Barrier Dispatched $\cap$ Dependency **Resource idle** Iter. boundary Resource in use C[i] = A[i] \* B[i]All data at dest. Time **Stream Commands** C1) Mem $\rightarrow$ Scratch Scratchpad C2) Scratch Wr Barrier Program Order C3) Scratch $\rightarrow$ Port A C4) Mem $\rightarrow$ Port B Х 0.0.4 C5) Port C → Mem С C6) Mem $\rightarrow$ Port B 0 C7) All Barrier **CGRA** fabric state Processing Low-power core state

generation 2017 Stream-Dataflow Acceleration Talk

Resume

Command

June 27, 2017

# Stream-Dataflow Execution Model UCLA Detailed Example



## **Stream-Dataflow Accelerator Potential**

1. Dataflow based pipelined concurrent execution

### 2. High Computation Activity Ratio: Number of Computations/Stream Commands

C6) Mem → Port B C7) All Barrier CGRA fabric state

### Low-power core state

June 27, 2017

Program Order





# Inefficiencies in Data-Parallel UCLA

# **Architectures**



# **Stream-Dataflow Accelerator**

# Architecture Opportunities

- Reduce address generation & duplication overheads
- Distributed control to boost pipelined concurrent execution
- High utilization of execution resources w/o massive multi-threading, reducing cache pressure or using multi-ported scratchpad
- Decouple access and execute phases of programs
- Able to be easily customizable/configurable for new application domain





# Stream-Dataflow Accelerator Architecture

**—** 512b **———** 64b **—**—— Stream Command

### **Multi-Tile Stream-Dataflow Accelerator**



- Each tile is connected to higher-L2 cache interface
- Need a simple scheduler logic to schedule the offloaded streamdataflow kernels to each tile

UCLA



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk







June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



June 27, 2017

ISCA 2017 Stream-Dataflow Acceleration Talk



#### Softbrain Stream Engine Request Pipeline

- Responsible for address generation for both affine and non-affine data-streams
- Priority based selection among multiple queued data-steams
- Affine streams Affine Address Generation Unit (AGU) generates memory addresses
- Non-affine AGU gets addresses and offsets from indirect vector ports
- Similar stream request pipeline is used for scratchpad stream-engines with minimal changes



UCI A



# Programming Stream-Dataflow Accelerator

- 1. Specify Datapath for the CGRA
  - Simple Dataflow Language for DFG
- 2. Orchestrate the parallel execution of hardware components
  - Coarse-grained stream commands using the stream-interface





## **Example Code: Dot Product**

#### **Original Program**

for(int i = 0 to N) {
 dot\_prod += a[i] \* b[i]
}





| Scalar                                                                                 | Vector                                                                                                                 | <b>Stream-Dataflow</b>                                       |  |
|----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------|--|
| <pre>for(i = 0 to N) {    Send a[i] → P1    Send b[i] → P2 } Get P3 -&gt; result</pre> | <pre>for(i = 0 to N, i+=vec_len) {    Send a[i:i+vec_len] → P1    Send b[i:i+vec_len] → P2 } Get P3 -&gt; result</pre> | Send a[i:i+N] → P1<br>Send b[i:i+N] → P2<br>Get P3 -> result |  |

~2N Instructions

#### ~2N/vec\_len Instructions

~3 Instructions



## **Existing Architectures for Data Parallel**

#### Vector Processor

(eg. ARM Neon, X86 SSE)



#### **Spatial Processor**

(eg. Tilera, TRIPS, Wavescalar)



- Amortized Instruction Issue
- Efficient Vector-Memory

- Efficient Dataflow b/t Units
- Flexible Computation Patterns

June 27, 2017



#### **Existing Architectures for Data Parallel**

#### **Vector Processor**

(eg. ARM Neon, X86 SSE)



#### **Spatial Processor**

(eg. Tilera, TRIPS, Wavescalar)



#### Vectorized memory interface + Spatial Datapath + Amortized Issue



## Dataflow Graph (DFG) for CGRA







#### **Stream Dataflow Program:**

```
uint16_t synapse[Nn][Ni];
uint16 t neuron i[Ni];
uint16_t neuron_n[Nn];
SD CONFIG(dfg config, dfg size);
SD DMA READ(synapse, 8, 8,Ni*Nn/4,P_dfg_S);
SD DMA READ(neuron i, 0,Ni*2,Nn, P dfg N);
for (n = 0; n < Nn/nthreads; n++) {
  SD_CONST(P_dfg_acc,0,1);
  SD RECURRENCE(P dfg out,Ni/4-1,Port acc);
  SD CONST(P dfg do sig,0,Ni/4-1);
  SD_CONST(P_dfg_do_sig,1,1);
  SD_DMA_WRITE(P_dfg_out,2,2,1,&neuron_n[n]);
SD WAIT ALL();
```





# **Performance Considerations**

- Goal: Fully Pipeline the Largest Data Flow Graph!
- Primary Bottlenecks:

Size of Data Flow Graph

Increase through Loop Unrolling/Stripmining

General Core (for Issuing Streams)

Increase "length" of streams

Memory/Cache Bandwidth

Use Scratchpad for reused Data

Recurrence Serialization Overhead

Either: 1. Increase Parallel Computations (tiling) 2. Use internal accumulation ΠΟΙ Φ



#### **Optimized DFG**





# **Optimized Classifier Layer**



Synapses (Nn x Ni)



## **Optimized Classifier Layer**





Synapses (Nn x Ni)

Output Neurons (Nn)

# **DianNao Power/Area Comparison**

|                                  |            | area(mm <sup>2</sup> ) | power (mw |
|----------------------------------|------------|------------------------|-----------|
| Control Core +<br>16kB I & D\$   |            | 0.16                   | 39.1      |
|                                  | Network    | 0.12                   | 31.2      |
| CGRA                             | FUs (4×5)  | 0.04                   | 24.4      |
|                                  | Total CGRA | 0.16                   | 55.0      |
| 5×Stream Engines                 |            | 0.02                   | 18.3      |
| Scratchpad (4KB)                 |            | 0.1                    | 2.0       |
| Vector Ports<br>(Input & Output) |            | 0.03                   | 3.0       |
| 1 Softbrain Total                |            | 0.47                   | 119.3     |
| 8 Softbrain Units                |            | 3.76                   | 954.4     |
| DianNao                          |            | 2.16                   | 418.      |
| Softbrain / DianNao<br>Overhead  |            | 1.74                   | 2.28      |

Table 3: Area and Power Breakdown / Comparison (All numbers normalized to 55nm process technology)





















#### Softbrain vs. DianNao vs. GPU





## **ASIC Area Relative to Softbrain**





#### Softbrain vs. ASIC Power Efficiency Comparison





#### Softbrain vs. ASIC Energy Efficiency Comparison

