System-on-a-Chip: A Case for Heterogeneous Architectures

Jan M. Rabaey BWRC University of California @ Berkeley http://bwrc.eecs.berkeley.edu

With contributions from Richard Newton and many others



#### Design at a Crossroad Silicon technology tracking Moore's Law



#### Design at a Crossroad Applications beat Moore's Law



#### Design at a Crossroad The Productivity Gap



## Design at a crossroad System-on-a-Chip



- Embedded applications where cost, performance, and energy are the real issues!
- DSP and control intensive
- Mixed-mode
- Combines programmable and application-specific modules
- Software plays crucial role

#### The Distributed Approach to Information Processing



## The Changing Metrics

- Power and/or Energy have become dominant drivers
  - Limiting factor for performance and reliability in wall-plugged applications
  - Enabler for wide-spread use of distributed computing and data access

Energy reduction requires joint optimization process between application and implementation

# The Changing Metrics

- Cost of fabrication facilities and mask making has increased significantly
  - NRE cost of new design has increased significantly
- Physical effects (parasitics, reliability issues, power management) are increasingly significant in the design process
  - These must now be considered explicitly at the circuit level
- Design complexity, and "context complexity" is sufficiently high that design verification is a major limitation on time-to-market

# **Towards Fewer, but more Flexible and Reusable Silicon Platforms**

## The Changing Metrics



## The System-on-a-Chip Nightmare



"Femme se coiffant" Pablo Ruiz Picasso 1940

#### The System-on-a-Chip Nightmare



#### System-on-a-Chip A Renaissance in Design

#### Design Methodology Hard+Soft

Aart De Geus DAC'99 Convergence

Implementation Fabrics Silicon substrate Silicon fabrics

Applications Multimedia Consumer Communications



in the state of th

Embedded ARM-8 Microprocessor (Hard IP) Tensilica Synthesized and Configurable µProcessor (Soft IP)



Courtesy of ARM, Tensilica Inc

"Very-Short Instruction Word" Processors







Philips Nexperia NX-2700 A programmable HDTV media processor Combines Trimedia VLIW with Configurable media co-processors



## **Architectural Choices**



## The Energy-Flexibility Gap





#### **Programming the Platform**



## Fast Design Space Exploration



# A Case Study The Integrated CMOS Radio





#### Trends in Wireless Systems

Towards better spectrum utilization
using aggressive signal and protocol processing
Examples: multi-user detection, multi-antenna arrays
adaptive, multi-functional networks
Example: IMTS2000 / UMTS (3G)
Towards ubiquitous wireless networking
Example: Bluetooth, HomeRF, FireFly

Resulting requirements high performance, low-energy, adaptivity and flexibility

#### Issues in Single-Chip Radio Design



#### The Software Radio

A/D Converter

D/A Converter

Idea: Digitize (wideband) signal at antenna and use signal processing to extract desired signal

DSP

- Leverages of advances in technology, circuit design, and signal processing
- Software solution enables flexibility and adaptivity, but at huge price in power and cost
- 16 bit A/D converter at 2.2 GHz dissipates 1 to 10 W

# The Mostly Digital Radio



#### The Software-Definable Radio



## A Trend Towards GP DSPs?



#### Single-Chip DSPs are Lagging ...



#### And Computation Seems Almost for Free





#### 0.25 μm CMOS process

- Area: 0.18 m<u>m<sup>2</sup></u>
- Perf: 25 MCMACS/sec @ 1V
- Energy: 40 MCMACs/mW

# Energy Trends in DSPs



mW/MIPS

## The Implementation Trade-off



#### Adaptive Multi-User Detection A Direct Mapping Approach



Power and area are dominated by MACs and multiplies Only 36% of power of DSP-processor solution going into arithmetic

## Reconfigurable Computing: Merging Efficiency and Versatility

#### Spatially programmed connection of processing elements.

$$y = Ax^2 + Bx + C$$



"Hardware" customized to specifics of problem. Direct map of problem specific dataflow, control. Circuits "adapted" as problem requirements change.

#### A New Look at Architectures — Heterogeneous Reconfiguration



#### Multi-granularity Reconfigurable Architecture: The Berkeley Pleiades Architecture

#### **Configuration Bus** Satellite Processor Arithmetic Arithmetic Arithmetic onfiguratio Processor Processor Processor Dedicated Arithmetic **Communication Network** Network Interface Configurable Configurable Control Datapath Logic Processor

- Computational kernels are "spawned" to satellite processors
- Control processor supports RTOS and reconfiguration
- Order(s) of magnitude energy-reduction over traditional programmable architectures

#### Matching Computation and Architecture



#### **Example: Covariance Matrix Computation**



### **Reconfigurable Kernels for W-CDMA**



- Dominant kernel M(M<sup>T</sup>X) requires array of MACs and segmented memories
- Additional operations such as sqrt(x), 1/x, and Trellis decoding may be implemented using FPGA or cordic satellite



### Data-driven Synchronization Based on Finite Streams

- "Smart" satellites able to handle data inputs of different types
- Support of multi-dimensional signal processing
- Introduction of data types: scalars, vectors, matrices



#### Impact of Architectural Choice



#### Adaptive Multi-User Detector for W-CDMA Pilot Correlator Unit Using LMS



### Architecture Comparison

LMS Correlator at 1.67 MSymbols Data Rate Complexity: 300 Mmult/sec and 357 Macc/sec

| Туре         | Power    | Area                 |
|--------------|----------|----------------------|
| TMS320C54*   | 460 mW   | 1089 mm <sup>2</sup> |
| Pleiades     | 18.09 mW | $5.448 \text{ mm}^2$ |
| ASIC [Zhang] | 3 mW     | $1.5 \text{ mm}^2$   |

16 Mmacs/mW!

Note: TMS implementation requires 36 parallel processors to meet data rate validity questionable

### Maia: Reconfigurable Baseband Processor for Wireless



- 0.25um tech: 4.5mm x 6mm
- 1.2 Million transistors
- 40 MHz at 1V
- 1 mW VCELP voice coder
- • Hardware
  - \_• 1 ARM-8
  - 8 SRAMs & 8 AGPs
  - 2 MACs
  - 2 ALUs
  - 2 In-Ports and 2 Out-Ports
  - 14x8 FPGA

## Fast Design Space Exploration Interconnect Models



# **Design Methodology and Flow**

**Requires architecture exploration over** heterogeneous implementation fabrics Should support refinement and codesign of hardware and software, as well as behavior and architecture Should consider all important metrics, and present PDA (Power-Delay-Area) perspective

### Software Methodology Flow



### Hardware-Software Exploration

|                                 | ☑ Netscape: Dot_Product summary 四                            |                                                       |                                                                                             |                   |                                                                                   |      |  |  |  |
|---------------------------------|--------------------------------------------------------------|-------------------------------------------------------|---------------------------------------------------------------------------------------------|-------------------|-----------------------------------------------------------------------------------|------|--|--|--|
|                                 |                                                              |                                                       | File                                                                                        | Edit View Go Book | okmarks Options Directory Window                                                  | Help |  |  |  |
| T                               | VSELP energy brea                                            |                                                       |                                                                                             |                   |                                                                                   |      |  |  |  |
|                                 | (only function calls are show PLAY PLAY and SAVE Dot_Product |                                                       |                                                                                             |                   |                                                                                   |      |  |  |  |
| Γ                               |                                                              | IIRfilter (5.131%)                                    |                                                                                             | arameter Value    |                                                                                   | П    |  |  |  |
| ConvertToReflection<br>(0.680%) |                                                              |                                                       |                                                                                             |                   |                                                                                   |      |  |  |  |
|                                 |                                                              | ConvertToDirectFor                                    | #                                                                                           | Name              | Parameters Cost Functions                                                         | _    |  |  |  |
|                                 |                                                              | (0.030%)<br>QuantizeGains (18.4                       |                                                                                             | Address_Generator | orAccess: [325110]Domain:inheritEnergy/Access = 3.9e+00 plEnergy = 1.2e+00 uJ     | J    |  |  |  |
|                                 |                                                              | theta (0.030%)                                        | 2                                                                                           | Memory            | Access: $325110$ Domain:inheritEnergy/Access = 2.5e+01 plEnergy = 8.2e+00 uJ      | J    |  |  |  |
|                                 |                                                              |                                                       | 3                                                                                           | MAC               | Access: [162555]Domain:1.2um LibraryEnergy/Access = 1.0e+02 plEnergy = 1.6e+01 uJ | J    |  |  |  |
|                                 |                                                              | SearchCodebook<br>(37.684%)                           | I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I<br>I |                   |                                                                                   |      |  |  |  |
| ľ                               | nain                                                         | n ComputeLag (32.553%) dot_product<br>IIRfilter (3.25 |                                                                                             |                   |                                                                                   |      |  |  |  |
|                                 |                                                              |                                                       |                                                                                             |                   |                                                                                   |      |  |  |  |
|                                 |                                                              |                                                       |                                                                                             |                   |                                                                                   |      |  |  |  |
|                                 |                                                              |                                                       |                                                                                             |                   |                                                                                   |      |  |  |  |

### Implementation Fabrics for Protocols



# Intercom TDMA MAC Implementation alternatives

|        | ASIC      | FPGA      | ARM8       |
|--------|-----------|-----------|------------|
| Power  | 0.26mW    | 2.1mW     | 114mW      |
| Energy | 10.2pJ/op | 81.4pJ/op | n*457pJ/op |

- ASIC: 1V, 0.25 μm CMOS process
- FPGA: 1.5 V 0.25 μm CMOS low-energy FPGA
- ARM8: 1 V 25 MHz processor; n = 13,000
- Ratio: 1 8 >> 400

### The Software-Defined Radio



#### Summary and Perspective

Technology scaling is redefining the term "complexity" System-on-a-Chip fosters renaissance in processor architecture, opening the door for new models and combinations thereof: **Component and Communication Based Design** SOC driven by new set of metrics: how to simultaneously optimize flexibility, cost, energy, and performance? **Reconfigurable architectures provide tantalizing** combination of flexibility and efficiency Numerous solutions for addressing the data-intensive component of the software-defined radio - the next challenge is control