## **Accelerator-level Parallelism**



#### Mark D. Hill, Wisconsin & Vijay Janapa Reddi, Harvard

## @ Technion (Virtually), June 2020

Aspects of this work on Mobile SoCs and Gables were developed while the authors were "interns" with Google's Mobile Silicon Group. Thanks!

#### **Accelerator-level Parallelism Call to Action**

Future apps demand much more computing



Standard tech scaling & architecture NOT sufficient

Mobile SoCs show a promising approach:

ALP = Parallelism among workload components concurrently executing on multiple accelerators (IPs)

Call to action to develop "science" for ubiquitous ALP

#### Outline

- I. Computer History & X-level Parallelism
- II. Mobile SoCs as ALP Harbinger
- III. Gables ALP SoC Model
- **IV.** Call to Action for Accelerator-level Parallelism

#### 20<sup>th</sup> Century Information & Communication Technology

Has Changed Our World

<long list omitted>

Required innovations in algorithms, applications, programming languages, ..., & system software

Key (invisible) enablers (cost-)performance gains

- Semiconductor technology ("Moore's Law")
- Computer architecture (~80x per Danowitz et al.)

#### **Enablers: Technology + Architecture**



#### How did Architecture Exploit Moore's Law?

MORE (& faster) transistors  $\rightarrow$  even faster computers

**Memory** – transistors in parallel

- Vast semiconductor memory (DRAM)
- Cache hierarchy for fast memory illusion

**Processing** – transistors in parallel Bit-, Instruction-, Thread-, & Data-level Parallelism

#### Now Accelerator-level Parallelism



#### 1 CPU

BLP+ILP Bit/Instrn-Level Parallelism

#### **Bit-level Parallelism (BLP)**

Early computers: few switches (transistors)

- → compute a result in many steps
- E.g., 1 multiplication partial product per cycle

**Bit-level** parallelism

- More transistors → compute more in parallel
- E.g., Wallace Tree multiplier (right)

Larger words help:  $8b \rightarrow 16b \rightarrow 32b \rightarrow 64b$ 

#### **Important: Easy for software**

#### NEW: Smaller word size, e.g. machine learning inference accelerators



## Instruction-level Parallelism (ILP)



E.g., Intel Skylake has 224-entry reorder buffer w/ 14-19-stage pipeline

#### Important: Easy for software



1 CPUMultiprocessorBLP+ILP+ TLPBit/Instrn-LevelThread-LevelParallelismParallelism

#### **Thread-level Parallelism (TLP)**

Thread-level Parallelism

- HW: Multiple sequential processor cores
- SW: Each runs asynchronous thread

# SW must partition work, synchronize, & manage communication

• E.g. pThreads, OpenMP, MPI

On-chip TLP called "multicore" – forced choice

#### Less easy for software but

- More TLP in cloud than desktop  $\rightarrow$  cloud!!
- Bifurcation: experts program TLP; others use it



CDC 6600, 1964, (TLP via multithreaded processor)



Intel Pentium Pro Extreme Edition, early 2000s



1 CPUMulticoreBLP+ILP+ TLPBit/Instrn-Level<br/>ParallelismThread-Level<br/>Parallelism

#### **Data-level Parallelism (DLP)**

Need same operation on many data items Do with parallelism → DLP

- Array of single instruction multiple data (SIMD)
- Deep pipelines like Cray vector machines
- Intel-like Streaming SIMD Extensions (SSE)



Illinois ILLIAC IV, 1966

Broad DLP success awaited General-Purpose GPUs

- **1. Single Instruction Multiple Thread (SIMT)**
- 2. SW (CUDA) & libraries (math & ML)
- 3. Experimentation as \$1-10K not \$1-10M



NVIDIA Tesla

**Bifurcation again: experts program SIMT (TLP+DLP); others use it** 



1 CPUMulticoreBLP+ILP+ TLPBit/Instrn-LevelThread-LevelParallelismParallelism

+ Discrete GPU

+ DLP Data-Level Parallelism



1 CPUMulticore+ Integrated GPUBLP+ILP+ TLP+ DLPBit/Instrn-LevelThread-LevelData-LevelParallelismParallelismParallelism



1940 1950 1960 1970 1980 1990 2000 2010 2020

#### Outline

- I. Computer History & X-level Parallelism
- **II.** Mobile SoCs as ALP Harbinger
- III. Gables ALP SoC Model
- **IV.** Call to Action for Accelerator-level Parallelism



**1 CPU** BLP+ILP Bit/Instrn-Level Parallelism Multicore + Integrated GPU

+ TLP Thread-Level

Parallelism

+ DLP Data-Level Parallelism System on a Chip (SoC) + ALP Accelerator-Level Parallelism

## **Potential for Specialized Accelerators (IPs)**

**Accelerator** is a hardware component that executes a targeted computation class faster & usually with (much) less energy.



16 Encryption17 Hearing Aid18 FIR for disk read19 MPEG Encoder20 802.11 Baseband

[Brodersen & Meng, 2002]

## CPU, GPU, xPU (i.e., Accelerators or IPs)



2019 Apple A12 w/ 42 accelerators

42 Really?

The Hitchhiker's Guide to the Galaxy?



#### Mobile SoCs Run Usecases

| Accelerators (IPs) →<br>Usecases (rows) | CPUs<br>(AP) | Display | Media<br>Scaler | GPU | Image<br>Signal<br>Proc. | JPEG | Pixel<br>Visual<br>Core | Video<br>Decoder | Video<br>Encoder | Dozens<br>More |
|-----------------------------------------|--------------|---------|-----------------|-----|--------------------------|------|-------------------------|------------------|------------------|----------------|
| Photo Enhancing                         | Х            | Х       |                 | x   | Х                        | Х    | Х                       |                  |                  |                |
| Video Capture                           | Х            | Х       |                 | Х   | Х                        |      |                         |                  | X                |                |
| Video Capture HDR                       | Х            | Х       |                 | Х   | Х                        |      |                         |                  | Х                |                |
| Video Playback                          | Х            | Х       | Х               | Х   |                          |      |                         | Х                |                  |                |
| Image Recognition                       | Х            | Х       | Х               | Х   |                          |      |                         |                  |                  |                |

Must run each usecase sufficiently fast -- no need faster A usecase uses IPs concurrently: **more ALP** than serial For each usecase, how much acceleration for each IP?

#### ALP(t) = #IPs concurrently active at time t



#### Outline

- I. Computer History & X-level Parallelism
- II. Mobile SoCs as ALP Harbinger
- III. Gables ALP SoC Model [HPCA'19]
- **IV.** Call to Action for Accelerator-level Parallelism

## Mobile SoCs Hard To Program For and Select

Envision usecases (years ahead) Port to many SoCs??

Diversity hinders use [Facebook, HPCA'19]

How to reason about SoC performance?



#### **Mobile SoCs Hard To Design**

Envision usecases (2-3 years ahead) Select IPs Size IPs Design Uncore



#### Which accelerators? How big? How to even start?

## **Computer Architecture & Performance Models**



Amdahl's Law

Multicore & Roofline

Models vs Simulation

- More insight
- Less effort

But less accuracy Models give first answer, not final answer **Gables** extends Roofline **→** first answer for SoC ALP

## **Roofline for Multicore Chips, 2009**

Multicore HW

- P<sub>peak</sub> = peak perf of all cores
- B<sub>peak</sub> = peak off-chip bandwidth



#### Multicore SW

- I = operational intensity = #operations/#off-chip-bytes
- E.g., 2 ops / 16 bytes  $\rightarrow$  I = 1/8

Output P<sub>att</sub> = upper bound on performance attainable

#### **Roofline for Multicore Chips, 2009**



Compute v. Communication: Op. Intensity (I) = #operations / #off-chip bytes

## ALP System on Chip (SoC) Model: NEW Gables



2019 Apple A12 w/ 42 accelerators



Gables uses Roofline per IP to provide first answer!

- SW: performance model of a "gabled roof?"
- HW: select & size accelerators



# Usecase at each IP[i] Operational intensity I<sub>i</sub> operations/byte Non-negative work f<sub>i</sub> (f<sub>i</sub>'s sum to 1) w/ IPs in parallel

#### **Example Balanced Design Start w/ Gables**











#### Approach: Combine Analytical and Simulation Models



# Case Study: IT Company + Synopsys

Two cases where: Gables >> Actual

- 1. Communication between two IP blocks
- **Root:** Too few buffers to cover communication latency
- Little's Law: # outstanding msgs = avg latency \* avg BW
- https://www.sigarch.org/three-other-models-of-computer-system-performance-part-1/
- Solution: Add buffers; actual performance  $\rightarrow$  Gables
- 2. More complex interaction among IP blocks
- **Root:** Usecase work (task graph) not completely parallel
- **Solution:** No change, but useful double-check

# Case Study: Allocating SRAM







Where SRAM?

- Private w/i each IP
- Shared resource

## Does more IP[i] SRAM help Op. Intensity (I<sub>i</sub>)?

Compute v. Communication: Op. Intensity (I) = #operations / #off-chip bytes



Non-linear function that increases when new footprint/working-set fits

#### Should consider these plots when sizing IP[i] SRAM

Later evaluation can use simulation performance on y-axis

### **Gables Home Page**

[HPCA'19]

Model Extensions

Interactive tool

#### Gables Android Source at GitHub

http://research.cs.wisc.edu/multifacet/gables/





## Mobile System on Chip (SoC) & Gables



SW: Map usecase to IP's w/ many BWs & acceleration HW: IP[i] under/over-provisioned for BW or acceleration? Gables—like Amdahl's Law—gives intuition & a first answer But still missing is SoC "architecture" & programming model

#### Outline

- I. Computer History & X-level Parallelism
- **II.** Mobile SoCs as ALP Harbinger
- III. Gables ALP SoC Model
- **IV.** Call to Action for Accelerator-level Parallelism

#### **Future Apps Demand Much More Computing**













### **Accelerator-level Parallelism Call to Action**

Future apps demand much more computing



- Standard tech scaling & architecture NOT sufficient
- Mobile SoCs show a promising approach:
- ALP = Parallelism among workload components concurrently executing on multiple accelerators (IPs)

Call to action to develop "science" for ubiquitous ALP

- An SoC architecture that exposes & hides?
- A whole SoC programming model/runtime?



Key: P == processor core; A-E == accelerators

## SW+HW Lessons from GP-GPUs?

Programming for data-level parallelism: **four decades** SIMD→Vectors→SSE→SIMT!



Nvidia GK110 BLP+TLP+DLP

| Feature          | Then                                            |
|------------------|-------------------------------------------------|
| 1. Programming   | Graphics OpenGL                                 |
| 2. Concurrency   | Either CPU or GPU only;<br>Intra-GPU mechanisms |
| 3. Communication | Copy data between<br>host & device memories     |
| 4. Design        | Driven by graphics only;<br>GP: \$0B market     |

## **SW+HW Directions for ALP?**

Need programmability for broad success!!!! In less than four decades?



Apple A12: BLP+ILP+TLP+DLP+ALP

| Feature                                      | Now                                             |
|----------------------------------------------|-------------------------------------------------|
| 1. Programming                               | Local: Per-IP DSL & SDK<br>Global: Ad hoc       |
| 2. Concurrency                               | Ad hoc                                          |
| 3. Communication                             | SW: Up/down OS stack<br>HW: Via off-chip memory |
| 4. Design, e.g., select, combine, & size IPs | Ad hoc                                          |

# **Opportunities**

#### 1. Programmability

Whither global model/runtime? DAG of streams for SoCs?

#### 3. Communication

How should SW stack reason about local/global memory, caches, queues, & scratchpads? HW assist for scheduling? Virtualize & partition?

2. Concurrency

When combine "similar" accelerators? Power vs. area? **4. Design Space** 

Science

Hennessy & Patterson: A New Golden Age for Computer Architecture

# New Feb 2020!



A Primer on Memory Consistency and Cache Coherence Second Edition

> Vijay Nagarajan Daniel J. Sorin Mark D. Hill David A. Wood

Synthesis Lectures on Computer Architecture

Natalie Enright Jerger & Margaret Martonosi, Series Editors