

Scheduling

# Multicore Digital Signal Processing

# Maxime Pelcat 2014 Slides from M. Pelcat, K. Desnos, J-F. Nezan, D. Ménard, M. Raulet, J Gorin





#### Introduction: Porting Algorithms to Multicore DSPs

2

#### **Typical Code Development Environment**



## Possible MPSoC Dataflow-based Development Environment



Sciences Appliquées nstitut National des

#### **Addressed Problem**

- Distributed embedded systems are harder and harder to program
- Difficult Constraints
  - High computation requirements
  - Low power consumption
  - Many hardware and software choices: lack of information/metrics
  - Real-time constraints (hard of soft real-time)
  - Need to reuse legacy code

#### Difficult Goals

Design both hardware and software Balance loads Obtain the most from a given architecture Respect constraints

#### **Addressed Problem**

#### Traditional code in C abstracts core architecture

Amount of registers Number of pipeline stages Instruction parallelism Loop optimizations Cache accesses Data representation

 C code can not be efficiently transformed into coarse grain parallel code

Assumed global state in a program Unique activity point Inspired by the Turing machine

## • The solution may come from dataflow MoCs

. . .

#### **Grail of Multi-core DSP Programming**



Applig ciences ab Lenol Aational de

#### Multi-core code porting $\rightarrow$ assignment, ordering and timing













#### • And other tasks:

Choose communication (shared memory, DMA, direct copy) Choose communication synchronization (polling or interrupts) Allocate in memory, order and time communications

## **Code Deployment**



#### **Grail of Multi-core DSP Programming**





#### **General Outline**

15

- Demo Platform
- Applications
- Models of Computation
- Architectures

Institut National des Sciences Appliquées

- Models of Architecture
- Partitioning and Scheduling Problem
- Compile-time and Runtime Tools

#### **Demo Platform**



Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013

• EVM = Evaluation Module for theTMS320C6678



#### TMS320C6678

#### 8-Core DSP

8 C66x DSP Core Subsystems (C66x CorePacs), Each with:

- 1.0 GHz or 1.25 GHz C66x Fixed/Floating-Point CPU Core
   40 GMAC/Core for Fixed Point @ 1.25 GHz
   20 GFLOP/Core for Floating Point @ 1.25 GHz
   Total: 320 GMACs + 160 Gflops: hard to reach!
- 2 levels of Core Memory
  32K Byte L1P Per Core
  32K Byte L1D Per Core
  512K Byte Local L2 Per Core

#### • 4 MB of Internal Shared Memory

Multicore Shared Memory Controller (MSMC) L1D and L1P with automatic cache coherency in local Non coherent cache of the shared memory

#### Unified memory space for internal/external memory

#### **Demo Board**

• **512 MB of Shared DDR3 on the emulation board** Any core can access DDR3, 8G Byte of DDR3 Addressable Memory

#### Hardware coprocessors

For repetitive common operations Reduced because multi-purpose processor Cryptography Network

#### XDS510 JTAG

via USB Possibility to extend to XDS560 via extension

#### Packaging

40nm technology, 841-Pin Flip-Chip Plastic BGA (CYP)



#### **Inter-core Communication**

- KeyStoneTeraNet switch fabric (Network on Chip)
- Core Interrupt Controller
- Enhanced Direct Memory Access v3 (EDMA3)

Data movement Like a core with only MOV instructions

#### Multicore Navigator

8192 Multipurpose Hardware Queues with Queue Manager Data movement or zero-copy

#### Shared MSMC and DDR3

Data movement or zero-copy

#### **Inter-core Communication**

#### Multicore Navigator

Queue Manager Subsystem (QMSS) Packet DMA (PKTDMA) for Zero-Overhead Transfers Packet passing system between cores Abstracts real data transfer

#### Open Event Machine

Software runtime system provided by TI to offload code on cores Event driven processing runtime for multicore

#### Source: sprabh2a TI

#### **Inter-core Communication**

22



Sciences Appliquées Institut National des

#### Source: sprabh2a TI

#### **Inter-core Communication Throughputs**



#### **Cache Access Latencies**

#### Cache operations

The L1 has automatic cache coherence if local L2 is modified

L1 has no automatic cache coherence if non local memory is modified

- L2 has no automatic cache coherency
- $\rightarrow$ L1 and L2 cache « write back » and « invalidate » must be called

Up layer memory request: write back if modifications



#### Low layer memory request: invalidate if modifications



#### Source: sprabh2a TI

#### **Data Alignment and Performance**

25



|            |          |          | S            | ingle Read | Single Read | Burst Read | Burst Read |
|------------|----------|----------|--------------|------------|-------------|------------|------------|
|            | L1 Cache | L2 Cache | XMC Prefetch | No Victim  | Victim      | No Victim  | Victim     |
| All        | Hit      | NA       | NA           | 0          | NA          | . 0        | NA         |
| Local L2   | Miss     | NA       | NA           | 7          | 7           | 3,5        | 10         |
| MSMC (SL2) | Miss     | NA       | Hit          | 7,5        | 7,5         | 7,4        | 11         |
| MSMC (SL2) | Miss     | NA       | Miss         | 19,8       | 20,1        | 9,5        | 11,6       |
| MSMC (SL3) | Miss     | Hit      | NA           | 9          | 9           | 4,5        | 4,5        |
| MSMC (SL3) | Miss     | Miss     | Hit          | 10,6       | 15,6        | 9,7        | 129,6      |
| MSMC (SL3) | Miss     | Miss     | Miss         | 22         | 28,1        |            | 129,7      |
| DDR (SL2)  | Miss     | NA       | Hit          | 9          | 9           | 23,2       | 59,8       |
| DDR (SL2)  | Miss     | NA       | Miss         | 84         | 113,6       | 41,5       | 113        |
| DDR (SL3)  | Miss     | Hit      | NA           | 9          | 9           | 4,5        | 4,5        |
| DDR (SL3)  | Miss     | Miss     | Hit          | 12,3       | 59,8        | 30,7       | 287        |
| DDR (SL3)  | Miss     | Miss     | Miss         | 89         | 123,8       | 43,2       | 183        |

#### • Four Lanes of SRIO 2.1

1.24 to 5 GBaud Operation Supported Per Lane → up to 20 Gbauds

PCle Gen2

Single port supporting 1 or 2 lanes Supports Up To 5 GBaud Per Lane  $\rightarrow$  up to 10 Gbauds

HyperLink

Supports Connections to Other KeyStone → up to 50 Gbauds Architecture Devices Providing Resource Scalability

Gigabit Ethernet (GbE) Switch Subsystem

Two SGMII Ports Supports 10/100/1000 Mbps operation  $\rightarrow$  up to 2 Gbps

Other ports

UART Interface, I2C Interface, 16 GPIO Pins, SPI Interface

Remark

Uncompressed 1920x1080 4:2:0 video @ 60Hz = **1.5 Gbps** 

Code Composer Studio v5 (CCS) IDE

Based on Eclipse 3.7 Indigo Runs under Windows and Linux Integrable in an existing Eclipse

C66x compiler, linker, assembler, simulator...
 Delivered with CCS IDE

#### EVM Drivers

Installed with CCS Connect to the EVM JTAG (breakpoints...)

Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013

#### **Using DMAs**







29

#### **SYS/BIOS**



#### • Multi-task OS :

enables sharing the DSP between several tasks

#### OS with Static and Dynamic configuration

Static : « configuration tool », .cfgfile Dynamic : specific functions to access interruptions, task creation, logs...

- Gives many informations on the system for debug
- Preemptive RTOS

Scheduler task priorities

# • Alternative to DSP-BIOS:

Enea Solutions : other RTOS for C6x

Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

## **DSP BIOS Limits**





Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013

#### **Preesm Software**

- Developed at IETR under Eclipse
- From a dataflow graph to a multicore code execution

PREESM

Automates multicore communication/synchronization



#### **Multi-core DSP Programming**



Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013



**High Performance DSP Applications** 

34

- Overview
- Standization Processes
- MPEG HEVC
  - **4G**

Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

# **High Performance DSP Applications : Overview**

35

• Embedded system applications & High Performance computing applications

Base stations & software-defined radio Image and Audio processing Industrial control systems Aeronautics & transports Radar / Sonar Medical Scientific computing & numerical simulation High Performance Computing (HPC)

Institut National des Sciences Appliquées

. . .

# **High Performance DSP Applications : Overview**

• Embedded system applications & High Performance computing applications

|                |                   | OMAP             | C6000          | C5000           |                    |                  | Stellaris 32-Bit  |
|----------------|-------------------|------------------|----------------|-----------------|--------------------|------------------|-------------------|
|                | Digital Media     | Applications     | Digital Signal | Digital Signal  | C2000              | MSP430           | ARM Cortex-M3     |
|                | Processors        | Processors       | Processors     | Processors      | Microcontrollers   | Microcontrollers | MCUs              |
| Audio          |                   |                  |                |                 |                    |                  |                   |
| Automotive     |                   |                  |                |                 |                    |                  |                   |
| Communications |                   |                  |                |                 |                    |                  |                   |
| Industrial     |                   |                  |                |                 |                    |                  |                   |
| Medical        |                   |                  |                |                 |                    |                  |                   |
| Security       |                   |                  |                |                 |                    |                  |                   |
| Video          |                   |                  |                |                 |                    |                  |                   |
| Wireless       |                   |                  |                |                 |                    |                  |                   |
| Key Epotare    | Complete tailored | Low power and    | High           | Power-efficient | Performance,       | Ultra-low power  | Open architecture |
|                | video solution    | high performance | performance    | performance     | integration for    |                  | software, rich    |
|                |                   |                  |                |                 | greener industrial |                  | communications    |
|                |                   |                  |                |                 | applications       |                  | options           |

Source: TI

# **High Performance DSP Applications : Overview**

• Typical types of operations / tasks / actors

low-pass, band-pass, high-pass and adaptive filtering (FIR and IIR filters) cross/auto, linear/circular correlation (similarity between signals) Convolution (equivalent to multiplication in Fourier domain) transformations between domains (Fast Fourier, DCT, Hadamard, wavelet, Hilbert, Wigner-Ville...) noise removal power computation independent component analysis expected signal detection and extraction data prediction (temporal, spatial) entropy coding

complex, vector and matrix operations

forward error correction

- There are many standardization organizations
- Famous standardization organizations regarding signal processing include:

ISO (International Organization for Standardization)IEEE (Institute of Electrical and Electronics Engineers)ITU (International Telecommunication Union)3GPP (Third Generation Partnership Project )

#### MPEG HEVC Video Compression Standard

developed by the ITU-T Video Coding Experts Group (VCEG) together with the ISO/IEC JTC1 Moving Picture Experts Group (MPEG)

#### 3GPP LTE Radio Telecommunication Standard

developed by the 3GPP (3rd Generation Partnership Project) Respecting (partially) the ITU-R organization IMT-Advanced specification



**39** 

#### **DSP Application : MPEG AVC and HEVC**

Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013

#### **MPEG Standards**

- Mpeg-1 : (1992)
- Mpeg-2 : (1994)
- Mpeg-4 : (since 1998)

Example : Mpeg-4 Part 2 (DivX until v5,Xvid) Extension1: Mpeg-4 part 10 = H.264 (ITU-T) = AVC (Advanced Video Coding)

Each standard : better compression (HEVC: HD@4Mb/s)



### Advanced Video Coding Decoder



Bitstream

#### **AVC Profiles**



Multicore DSP

#### **HEVC Decoder**

#### VLC Decoding



#### **Inverse Transform**

#### **Source: Hervé Yviquel**

Multicore DSP





# **DSP Application : 4G**

Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013



Institut National des Sciences Appliquées

**LTE Access Network** 

46



es Appliquées



Multicore DSP

inlinuées

#### **Frequency Allocations**

#### • 3GPP LTE Frequency Allocations in France (ACERP october 2011)



#### •Spectrum flexibility

In both downlink and uplink

| Bandwidth<br>(MHz)                           | 1.4 | 3   | 5   | 10   | 15   | 20   |
|----------------------------------------------|-----|-----|-----|------|------|------|
| OFDM FFT<br>size                             | 128 | 256 | 512 | 1024 | 1536 | 2048 |
| Number of<br>available<br>PRBs<br>(downlink) | 6   | 12  | 25  | 50   | 75   | 100  |
| Number of<br>available<br>subcarriers        | 72  | 144 | 300 | 600  | 900  | 1200 |



# Size (in complex values) of one symbol

# **3GPP LTE Rel. 9 Performance**

# High Data Rates

50Mbps(UpLink), 100Mbps (DownLink)

# High User Equipment Speed

Walking to bullet-train (optimized up to 120 km/h)

# Reduced Latency

Quick response time (under 5ms)

# Cheap Roll-out

Bandwidth flexibility

# Optimized for packet-switching

Good support for VoIP and data

- Up to 100km radius cells (35km for GSM macrocells)
- Up to 100 user per cell

# • Free to consult: search 36.211 and 36.212 in Google

# LTE and OSI Layers



transparent transfers between end-users packet flow, error correction...

connection-less transfers, variable-length data sequences, packet fragmentation, logical addressing...

physical layer control, error correction, point-to-point and point-to-multipoint control, data preparation for physical layer...

> connection, contention resolution, modulation, channel adaptation, physical signal manipulation, bit flow generation...

> > LTE Specific Implementation



7. Application Layer

6. Presentation Layer

ASCII...

5. Session Layer

4. Transport Layer

3. Network Layer

2. Data Link Layer

1. Physical Layer

Ethernet, PPP...

RS-232, 802.11a/b/g/n..

IP...

TCP, UDP, SSL, TLS...

SSH...

FTP, HTTP, SMTP, Telnet...

Multicore DSP

#### Application is naturally described by a dataflow graph



This application requires high DSP computing power



Sciences Appliquées

### LTE eNodeB DSP Programming



#### **Static versus Variable Algorithms**

54



Multicore DSP

# **Conclusion from Applications Part**

#### Applications are complex!

- A designer should not need to be an expert in both application and architecture
- Legacy code reuse between systems is absolutely needed
  - When a programmer has generated a functional efficient piece of code, he does want to reuse it
- A designer should not need to tweak his code for his target architecture
- Streaming applications are naturally broken down into dataflow actors

When we analyze an application, it is natural to use dataflow graphs





56

Languages and Models of Computation (MoCs) for Application Description

- Languages and MoCs
- Dataflow MoCs and Languages

Focus on PiMM and  $\pi$ SDF Focus on CAL

Practical Work Setup

Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

#### Languages and Models of Computation

• Among the 10 most used languages, all 10 are imperative, 7 are object-oriented. They share semantics.

57

• Other semantics exist. A MoC specifies semantics independently from a language syntax



**Semantics and Syntax** 

58

- UML implements object-oriented semantics
- C++ implements object-oriented semantics
- They share semantics but not syntax

Sciences Appliquées

nstitut National des

# Models of Computation for DSP

#### Useful for

Specification (especially for standards (cf. MPEG RVC)) Simulation (performance metrics and constraints checking) Execution (functional description)

• Families of MoCs :

#### Finite State Machine MoCs

MoC of imperative languages (C/C++/Java...) States and transitions based on conditions Computation is executed on transitions Representing the behavior of a Turing machine



#### Discrete Event MoCs

Modules react to events by producing events

Events tagged in time, i.e. the time at which events are consumed and produced is essential and is used to model system behavior Modules share a clock

51

Used to model HDL behavior

#### Functional MoCs

No state, a program returns the result of composed mathematical functions: result = f g h(inputs)

Based on lambda calculus

Haskell, Caml, Scheme, XSLT

Functional languages are examples of declarative languages

# **Models of Computation for DSP**

#### Petri Nets

Close to state machines but able to represent parallelism Operations are done on transitions



#### Synchronous MoCs

Like in Discrete Events, modules react to events by producing new events Contrary to Discrete Events, time is not explicit and only the simultaneity of events and causality are important Language examples: Signal, Esterel, Lustre...

Sciences Appliquées nstitut National des

# Process Network MoCs

concurrent and independent modules (processes) communicate ordered tokens (data quanta) through First-InFirst-Out (FIFO) channels Include dataflow process networks

Untimed models: the time is abstracted

→What we naturally used to describe MPEG HEVC and 3GPP LTE processing

• • •

Any syntax can be used to express these semantics

Kahn Process Networks

- Computation is done by processes (= Actors)
- Actors communicate only though data infinite FIFOs
   Actors do not share a state and have no internal state
- FIFO reading is blocked until a data arrives



• DPN is a special case of Kahn Process Networks defining:

Firing rules: conditions on which an actor is ready to execute Actor « invocation » (=« firing »)



Example of firing rule for Set:

-Set consumes 3 tokens coming from Do and fires an action Set1, producing 1 token

-Set consumes 1 token on Get and one token on Do and fires action Set2 producing 2 tokens

# There Exist Many Dataflow MoCs

#### • Differences

Is the number of exchanged tokens fixed/variable? Is it even specified? Does it depend of parameters? Is there an external control flow? Are there actor states

#### Devil is in the details

Dynamic Dataflow (DDF) is not a DPN because it can « peek » FIFOs (look inside The FIFO without popping the token)



### **There Exist Many Dataflow MoCs**



specification?

Predictability

- The dataflow MoC defines a coordination language
- Actors are implemented by a host language

Any host language can be used As long as the host language implements the firing rules We can combine and make communicate IPs coded in VHDL High-level software actors written in Java Low-level software actors written in C Which MoC should we use to program Multicore DSPs?



#### Small grain, Locally to a core

<u>Imperative languages</u> such as C/C++ have shown their capacity to use cores efficiently, including with low-level parallelism (SIMD, VLIW...) Finite state machine MoC

#### Coarse grain, to combine the cores

<u>Dataflow MoCs</u> are good candidates because they decouple actors All possible communications between actors are specified, so they can be used to organize data flows between cores

# Which Dataflow Model for a Given Application ?

# 3GPP LTE For example: Preamble Detection Uplink Decoding

Downlink Encoding





**Synchronous Dataflow** 

 Actors and Data Ports FIFO Queues and Delays **x1** Actor\_B Actor\_A 3 Actor\_C 1

Lee and D. Messerschmitt, "Synchronous data flow", Proceedings of the IEEE, 1987.

Multicore DSP







#### **Synchronous Dataflow and Actor State**

71



Lee and D. Messerschmitt, "Synchronous data flow", Proceedings of the IEEE, 1987.

Multicore DSP





- By resolving the topology equation, we can check consistency
- By verifying that enough initial tokens have been set, we can check schedulability

Ability to come back to the initial FIFO states



É PS

73



74



# **PiMM** and $\pi$ **SDF**

- Meta-model developped at IETR
- in collaboration with UMD and TI
- Targeted Dataflow Model of Computation

#### becomes:

Hierarchical & Compositional Statically parameterizable Dynamically reconfigurable

### PiMM fosters:

Predictability Memory boundedness Parallelism Lightweight runtime overhead Developer-friendliness





76





# Hierarchical Interfaces



data-flow graphs" in SiPS Proceedings, 2009.

The more dynamism, the less predictability



# **Configuring Parameters**

79

- Dynamic Parameters
- Configuration Actors



Institut National des Sciences Appliquées



80

- PiMM (Parameterized and Interfaced dataflow Meta-Model)
  - Evolution of IBSDF and PSDF



Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

# **CAL** and **DDF**

- CAL = CAL Actor Language, J. Ecker and J. Janneck
- It implements

the Dynamic Dataflow (DDF) MoC and a host code named CAL specifying firing rules

DDF graph gives no information on token flow

Firing rules are specified in the CAL language consumed tokens, guards, finite state machine, priorities





πSDF



# **CAL and DDF**



 Via classification, a CAL actor can be transformed into SDF, CSDF, and soon πSDF

In order to make the model more predictable

 The Orcc compiler, developed at IETR, is a compiler for CAL

Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

# **CAL** and **DDF**

- From CAL, both hardware and software can be generated
- CAL → Preesm is already working for static actors
- CAL → PiSDF is a future theme of research



DDF





# **Example: Describing 3GPP LTE Base Station Physical Layer**

85

Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013







 Applications are naturally broken down into dataflow actors

When we analyze an application, it is natural to use dataflow graphs



Many models exist with different semantics

- Dataflow models express parallelism in algorithms
- Syntax != Semantics

Semantics matter more than syntax

Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

### **Multi-core DSP Programming**

90



Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013





### **High Performance DSP Architectures**

- Types of embedded processors
- DSP vs. GPP
- Multicore DSP architectures

Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013

# **Von Neumann Architecture**

#### DSP : digital signal processors != GPP : General purpose processors → processors optimized and adapted to signal processing



### **Processor Generalities**

### • Processor types :

Microcontroller Digital Signal Processors (DSP) General purpose processors (GPP) GP-GPU (General-Purpose Processing on Graphics Processing Units)

### Choice factors:

Price (architecture complexity, production technology) Power consumption CPU Performances I/O performances Memory capacity (data/code)



Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013



Appliquées Sciences Institut National des

# **TI Processors and Applications**

|                |                                  | OMAP                              | C6000               | C5000                       |                                                                       |                  | Stellaris 32-Bit                                                 |
|----------------|----------------------------------|-----------------------------------|---------------------|-----------------------------|-----------------------------------------------------------------------|------------------|------------------------------------------------------------------|
|                | Digital Media                    | Applications                      | Digital Signal      | Digital Signal              | C2000                                                                 | MSP430           | ARM Cortex-M3                                                    |
|                | Processors                       | Processors                        | Processors          | Processors                  | Microcontrollers                                                      | Microcontrollers | MCUs                                                             |
| Audio          |                                  |                                   |                     |                             |                                                                       |                  |                                                                  |
| Automotive     |                                  |                                   |                     |                             |                                                                       |                  |                                                                  |
| Communications |                                  |                                   |                     |                             |                                                                       |                  |                                                                  |
| Industrial     |                                  |                                   |                     |                             |                                                                       |                  |                                                                  |
| Medical        |                                  |                                   |                     |                             |                                                                       |                  |                                                                  |
| Security       |                                  |                                   |                     |                             |                                                                       |                  |                                                                  |
| Video          |                                  |                                   |                     |                             |                                                                       |                  |                                                                  |
| Wireless       |                                  |                                   |                     |                             |                                                                       |                  |                                                                  |
| Key Feature    | Complete tailored video solution | Low power and<br>high performance | High<br>performance | Power-efficient performance | Performance,<br>integration for<br>greener industrial<br>applications | Ultra-low power  | Open architecture<br>software, rich<br>communications<br>options |

Source: TI

# Low cost System-on chip

#### Locosto (TCS2305)

#### Very cheap 65-nm technology

ARM7 CPU + c54x Minimal functionalities





AA

Y

8

# **Video Processing DSP**

#### Da Vinci TMS320DM644x

- 1 DSP tms320c64x+ (720MHz max) Video I/O for screen

- 1 ARM9
- 1 Video Accelerator



# **Advanced System-on chip**

#### OMAP™ 4 Platform: OMAP4430/OMAP4440

High power of computation Still low power consumption Interfaces to modern peripherals 2 General purpose ARM Cortex A9 cores (1GHz) IVA Supporting 1080p video encoding/decoding 3D graphics accelerator No DSP!!!



# **DSPs vs. GPPs**

- Optimization of the compilation and instruction set for signal processing
- Reduced Power consumption in DSPs
- Memory Management Unit (MMU) in the general purpose processor :
  - In DSPs, any memory is accessed by addresses: registers, stack, heap, OS memory...
  - Advanced OS (like Linux) need pagination: a virtual memory space in pages
  - The MMU converts the virtual addresses into the physical addresses of the hardware
  - The MMU can protect memory spaces from unwanted accesses DSPs use Real-Time OS: the c66x has SYS-BIOS
- Kalray VLIW core is an exception: a DSP with MMU

# **Power Dissipation**

• Less than 2W: no specific dissipation problem

 2 to 7W: heat sink and hot spot management

 7W+: heat sink, fan and complex hot spot management







# Why Go Multicore?



 Frequency increase → power consumption → heat heat → need for cooling, more faults, reduced longevity

Dynamic Power = Cte x capacity x voltage<sup>2</sup> x frequency But augmenting frequency augments leakage so voltage must be higher

Frequency x 1.5  $\rightarrow$  Power x2 on Freescale MPC8641 <sup>1</sup> Cores x2  $\rightarrow$  Power x1.3 on Freescale MPC8641 <sup>1</sup>

Moreover, augmenting frequency may require longer pipeline

### • To increase MIPS and limit Watts

Increase number of core instead than frequency

Source: Freescale « Embedded Multicore: An Introduction»

Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

102



# Why Go Heterogeneous?



### 3 sources of heterogeneity

Non uniform cores implementing different instruction sets Combining software cores with hardware IPs Non uniform communication performance

### Heterogeneity improves performance

Repetitive and costly actors can be efficiently computed by hardware logic (ASIC) Actors with some control and reconfiguration needs are suited for DSPs Control tasks with many conditions are suited to be run on GPP

Source: Freescale « Embedded Multicore: An Introduction»

Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

# TMS320TCI6488 (2007)

#### Towards multicore DSPs

#### TCI6488: Tri-core Telecommunication oriented DSP

- $\rightarrow$  Each core is programmed independantly
- $\rightarrow$  RSA: specific instruction set for CDMA operation (3G oriented)
- $\rightarrow$  I/O driven by EDMA





 $\rightarrow$  Up to 1GHz/24Gips

 $\rightarrow$  3MB L2 memory



104

# TMS320TCI6488 (2007)

10



### TMS320C6678 – Keystone I (2011)



# New c66x core: floating point arithmetic

107

#### **Representing real numbers**

Floating-point arithmetics : Sign - exponent - mantissa



# 108

# TMS320C6678 – Keystone I (2011)

TCI6678: Octo-core DSP  $\rightarrow$  Up to 1.25GHz/320 GMACs/160 GFlops

- $\rightarrow$  Each core is programmed independently  $\rightarrow$  8MB L2 memory
- $\rightarrow$  I/O driven by Multi-core Navigator to automate transfers between cores
- ightarrow Fixed and floating-point ALUs

Core c66x = Arithmetic and logic units (L & S), Data management units (D), Multiplication units (M)



Sciences Appliquées

#### TI TMS320TCI6636 - Keystone II (2013)

109



Multicor**Sparce: Tl** 



## TI TMS320TCI6636 – Keystone II (2013)

#### • 8 C66x @ 1.2GHz

38.4 GMacs/Core for Fixed Point19.2 GFlops/Core for Floating Point

#### • Memory

32KB L1P Per Core
32KB L1D Per Core
1MB Local L2 Per Core
6 MB MSM SRAM Memory Shared by 8 DSPs

## ARM Cortex A15 Quad Core Cluster @ 1.2GHz

32KB L1I Per Core
32KB L1D Per Core
4MB L2 Cache Memory Shared by Quad Core
AMBA 4.0 AXI Coherency Extension Master Port connected to MSMC

## Multicore Navigator

16k Multi-purpose Hardware Queues with hardware queue manager Packet-Based DMA for Zero-Overhead Transfers

#### Source: TI

## Freescale MSC8144 Starcore DSP



#### **Source: Freescale**

Multicore DSP





Multicore DSP



#### Levels of parallelism in Multicore DSP Architectures

113

#### **Architecture Levels of Parallelism**



nstitut National des Sciences Appliquées

#### Multicore DSP





116

#### Splitting ALU into subparts

Example: 2 16-bit additions on 1 32-bit adder



# **Very Long Instruction Word**



#### VLIW characteristic

Architecture made-up of parallel FU Several instructions par cycle embedded in a macro-instruction Homogeneous architecture : more orthogonal Close to RISC processor Uniform register file made-up of several registers Load-Store architecture

#### **VLIW Example : C64x**





#### WLIV Compiler: Loop unrolling



# **WLIV Compiler: Software pipelining**

 Improving the throughput at the cost of latency and memory



Processor usage rate

#### **Convolution C code for VLIW and SIMD**







#### **Core-Level Parallelism**

122

#### **Inter-Processor Communication and Architecture Models**





## **Types of Inter-Processor Communications**

- Transmitting one token via inter-processor communication or shared memory
- Direct Signaling

Interrupting a core from another core to either push or pull data

Indirect Signaling

Delegating the transmission to a DMA (Direct Memory Access) element

Atomic arbitration

Via hardware semaphores in case of shared memory



# **Direct Signaling Communication**



- Distributed or shared memory
- In demand-driven case, first interrupt can be avoided

if core A is constantly demanding data and core B cannot erase data befor e it is consumed (or if data can be discarded)

• Inter-Processor communication can be ethernet, SRIO...





Distributed or shared memory

DMA must be able to access the whole memory space DMA is another master on the memory bus Cores A and B are free to compute during DMA transfer



#### **Protecting shared memory accesses by semaphores**



#### 1-place FIFO on Shared memory

Possibility of N-places FIFO with a round buffer and read and write indexes

2-place FIFO = « ping-pong » buffer

#### More than a mutex

#### The 2 semaphores ensure alternate accesses from cores A and B Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

**Communication Speed** 



L2 access: 5.3 GB/s DDR Access: 2.6 GB/s MSMC Access: 8 GB/s

For comparison: Raw HD video 1080p60 4:2:0 = 0,19 GB/s

 Inter-processor communication libs mask the complexity



#### **Modeling Architectures**

### **Models of Architecture**



#### SystemC

C++ templates and libraries used to simulate hardware modules

#### • AADL

Separating hardware and software but specifying both and oriented towards threads and processes

• IP-XACT

More a syntax for serialization than a real model

#### Custom models

In MAPS compiler, in SynDEx rapid prototyping tool, in PREESM...

### **Models of Architecture**



- Often oriented towards hardware design debug
- Often custom and with no precise semantics
- Often no real separation between application and hardware

## **System-Level Architecture Model**



#### **Processing Element**





#### **S-LAM Examples**





#### ulticore DSP

#### S-LAM of a Board with 2 TCI6488





DSP 2

Multicore DSP



## **Conclusion from Architecture Part**

Architectures become more an more complex to program

Parallelism rises at instruction and task level

Heterogeneity rises

Performance necessitates a correct use of DMA, cache, IPC

#### **Multi-core DSP Programming**







#### Automatic porting of Multicore DSP Applications

- Parallelism Laws
- Multicore Scheduling
- Multicore Tools

Overview IETR Tools





#### **Parallelism Laws**

Institut National des Sciences Appliquées

#### Amdahl's Law



- Developped in 1967 by Gene Amdahl
- It gives a generic performance metric for applications
   Simplifying problem assumption: x% of the code is sequential, the rest is
   perfectly parallel
   With 5% of sequential code, speedup is limited to 7.5 on 12 cores
   Speedup refers to the acceleration brought by adding cores





- For different parallel section percentages
  - Simplifying problem assumption: x% of the code is sequential, the rest is perfectly parallel

With 50% of sequential code, speedup is limited to 1.84 on 12 cores



Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013

#### Amdahl's Law



- Amdahl's law has brought many doubts on multicores
   It does not take into account inter-process communication that worsens
   the speedup
- Why add more cores if the parallelism of applications limits speedups so much?

Gustavson's Law



- Developed by John Gustavson in 1988
- With a different hypothesis, Gustafson has shown the limits of Amdahl's law

Hypothesis: more cores imply more parallelism, the sequential section stays the same percentage of execution latency regardless the number of cores

in Amdahl's law, the percentage tends to 100% because the parallel section time reduces and the sequential section stays unmodified

With 5% of sequential code, speedup is limited to 11.89 on 12 cores

#### Gustavson's Law (1988)

• With a different hypothesis, Gustafson has shown the limits of Amdahl's law

Hypothesis: more cores imply more parallelism, the sequential section stays the same percentage of execution latency regardless the number of cores (in Amdahl's law, it tends to 100%)



Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

#### **Dataflow Speedup**

- The maximum speedup of dataflow execution car be computed
  - It is limited by the critical path length

Example: ignoring communication times







It is limited by the critical path length

Example: ignoring communication times critical path length = 1 + 5 + 3 + 2 + = 11ms work = 19 ms max speedup = 19 / 11 = 1.72



# **Preesm Speedup Assessment Chart**



Multicore DSP



## **Preesm Speedup Assessment Chart**

Speedup Assessment Chart Limitations
 The speedup assessment chart considers only latency
 →No pipelining is taken into account

All cores are considered identical in the chart (main operator) All communications have the same speed (main communication node)

## How to add speedup

Redescribe the application to find more parallelism Add initial tokens (delays) to pipeline

Institut National des Sciences Appliquées

Task/Data/Pipeline Parallelism

Parallelism in a dataflow graph

There are 3 types of parallelism: task, data and pipeline parallelism Specific to stream processing applications

Data parallelism



Task/Data/Pipeline Parallelism

Parallelism in a dataflow graph

There are 3 types of parallelism: task, data and pipeline parallelism Specific to stream processing applications

Task parallelism



## Task/Data/Pipeline Parallelism

Parallelism in a dataflow graph

There are 3 types of parallelism: task, data and pipeline parallelism Specific to stream processing applications



Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013

Parallelism in a dataflow graph



Parallelism in a dataflow graph







152

## **Scheduling Strategies**





Multicore DSP

# **Scheduling Strategies**



# More adaptivity

|                   | assignment | ordering | Timing  |
|-------------------|------------|----------|---------|
| fully dynamic     | run        | run      | run     |
| static-assignment | compile    | run      | run     |
| self-timed        | compile    | compile  | run     |
| fully static      | compile    | compile  | compile |

EA Lee Scheduling strategies for multiprocessor real-time DSP

More performance  $\checkmark$ 



# **Scheduling Strategies**



# More adaptivity

|                   | assignment | ordering | Timing  |
|-------------------|------------|----------|---------|
| fully dynamic     | run        | run      | run     |
| static-assignment | compile    | run      | run     |
| self-timed        | compile    | compile  | run     |
| fully static      | compile    | compile  | compile |

EA Lee Scheduling strategies for multiprocessor real-time DSP

More performance  $\psi$ 



- Assignment, ordering and timing
  - Part of « Operational Research »
    →How to organize a company
    →How to organize a project (Gantt chart...)
    →How to take decisions in general

## • NP hard problem

the verification that a possible solution of the problem is valid can be computed in polynomial time (verifying that a schedule is valid)

no polynomial time algorithm for NP-complete problems is known and it is likely that none exists.

When the problem grows (for example, the number of cores or actors), solving it is becoming more complex exponentially.

 Multicore scheduling is equivalent to quadratic assignment NP hard problem

N facilities, each pair of facilities (f,g) associated to a flow of communication

N locations to put the facilities, each pair of locations (l,m) associated to a distance

 $\rightarrow$  In which location (bijection) should we put each facility to minimize traffic (the sum of the distances multiplied by the corresponding flows)







• The real problem is more complex

M actors, N<M cores to put the facilities, each pair of locations (I,m) associated to a distance Heterogeneity: actors have different costs on different cores The objective function is not only communication minimization

latency, throughput...







The problems is solved by heuristics
 Exhaustive methods are useless
 Heuristics explore only parts of the given problem

#### Many heuristics exist

list scheduling, greedy scheduling FAST scheduling (Y-K Kwok) Hybrid flow-shop scheduling (J. Boutellier) Meta-heuristics (genetic algorithms, ant colonies...)

 It is not possible to predict the quality of the result of a heuristic

But models should contain enough information to take decisions

. . .



G

# **Heterogeneous Multicore Scheduling**

## List scheduling

Actors are scheduled in-order (topological order) The core that can finish actor execution first wins 1ms





# List scheduling

Actors are scheduled in-order (topological order) The core that can finish actor execution first wins 1ms







## Fast scheduling

Actors moved around to improve cost function (load balancing...) Critical path is more protected 1ms







## Genetic algorithm

Several mapping solutions are kept, they undergo mutations and cross-overs and worst solutions are regularly removed



## Load balancing

If no information is available on task / actor behavior

parallelism is brought by balancing the loads Execution is managed by task/job/actor queuing

Core 3

#### Unbalanced

Core 2

Core 1

Balanced







Load balancing without task behavior prediction
 Implemented over multi-threading

Great freedom in thread creation

n Work-queueing (Apple Grand Central Dispatch, OpenMP)

The shared task queue becomes the bottleneck





# Load balancing without task behavior prediction

Implemented over multi-threading

One task queue per core: No more bottleneck

Hard to predict performance

Job stealing (Cilk, Intel Threading Building Blocks)



Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013



# Load balancing without task behavior prediction

Implemented over multi-threading

One task queue per core: No more bottleneck

Hard to predict performance

Job stealing (Cilk, Intel Threading Building Blocks)







Scheduling with task behavior prediction

No adaptivity to algorithm modifications

No decision overhead



**Execution Schemes** 





**Actors versus threads** 



- Why not use threads and processes instead of actors?
- Threads share memory within a process

Multicore thread execution necessitates shared memory between cores Shared memory is increasingly costly when number of cores grows This method of parallelizing is showing its limits already with 8 cores

## Threads are designed for resource sharing

Cores, memory... What we want is more resource combining

## Actors are special types of processes

With firing rules, i.e. computation is triggered by available data Actors are dedicated to stream processing. Threads and processes can implement control-oriented code





# **Multicore Tools**

Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013

# **Some Tools and Initiatives**





Maxime Pelcat – mpelcat@insa-rennes.fr - June 2013





- Implemented on GCC 4.2
- Implemented in TI compiler for Keystone I
- Adds pragmas (metadata) in C to tell the compiler which loops can be parallelized
- Example:

```
#pragma omp parallel for
For(i=0; i<10; i++){
    T[i] = f(i);
}</pre>
```



The compiler knows that the iterations are independent (data parallelism in this case) and can be parallelized.

# **Other Tools and Initiatives with Common Objectives**



## Model-Driven-Engineering (MDE)

Starting from UML meta-model and meta-model transformation Defines a whole methodology from specification to final system UML profiles like UML Marte have been defined for that purpose

## Synchronous languages

Lustre, Signal, Esterel... Specifying synchronizations between operators: close to dataflow

#### C extensions

OpenMP, OpenCL, OpenACC, OpenHMPP, Cilk Principal objective: a painless migration to coarse-grain parallelism

#### StreaMIT - MIT

A language for stream processing, close to dataflow MoCs, and a compiler for massively parallel machiles

# **Other Tools and Initiatives with Common Objectives**

175

- Task queueing software frameworks
   Intel Threading Building Blocks, Apple Grand Central Dispatch
   Not specifically oriented towards stream processing applications
- Task queueing software/hardware frameworks

**Open Event Machine and Multicore Navigator** 

→ Available on C6678

Preesm: Rapid Prototyping for Multi-core DSP

- Eclipse-based Open Source Rapid Prototyping Tool
- Collaboration with Texas Instruments Nice
   Communication Infrastructure Business Unit (CI BU)
   Rapid prototyping of base station algorithms (used for LTE)

#### Goals

Combine imperative actor (host) code and dataflow coordination code Latency, memory and energy study of embedded code execution Before target hardware is available With efficient automatic parallelization heuristics Generating static multi-core code Reuse legacy code

# **Rapid Prototyping using Preesm**

17



Multicore DSP

# Algorithm Model

Static Hierarchical SDF with C-like behaviour (IBSDF and PiSDF) Combined with <u>C Actor Code</u>

# Architecture Model

System-Level Architecture Model (S-LAM) of heterogeneous architectures

Focuses on important <u>contention points</u>

# Prototyping

Shared memory / Message passing / DMA transfers / Load balancing enhancement



## **Static Code Generation : Self-Timed**

17



Multicore DSP



180

# **General Conclusion**

- Applications and architectures are increasingly complex
   Model-based system design helps at several design stages
- To evaluate languages/models: focus on MoC
   MoCs offer « pure » semantics, free of syntax
- No one-fit-all solution to design Multicore DSP systems

Many solutions exist now, complex choices have to be made

Maxime Pelcat - mpelcat@insa-rennes.fr - June 2013



