

**PULP PLATFORM** Open Source Hardware, the way it should be!

## An Open-Source Platform for High-Performance Non-Coherent On-Chip Communication

Thomas Benz <tbenz@iis.ee.ethz.ch>













### **Motivation**

- Trend towards more complex ICs
  - Iarger die sizes

on-chip: SRAM

off-chip: HBM2E

- feature scaling (Intel 20A)
- Increasing heterogeneity (ML accelerators)

Huge amount of high BW memory





Tesla D1: 450MiB on-chip SRAM





**H**zürich

#### Need for high-BW point-to-point data transfers

An Open-Source High-Performance On-Chip Communication



Intel: Ultra Path Interconnect

• AMD: Scalable Data Fabric

Not available for third parties (or only under royalties)

**ARM®AMBA®** 

Interconnect Standards

ARM: Advanced eXtensible Interface (AXI) (and others)

open standard that can be used without royalties

IBM: Power9 on-chip interconnect

## **AXI Implementations**

- Synopsys: DesignWare IP Solutions for AMBA Interconnect
- Cadence: VIP only

proprietary, expensive

ARM: AMBA Products (CoreLink NIC-400, CCI-500, ...)

Xilinx: LogiCORE IP Products (Interconnect, Data Width Converter, ...)

licensed with Xilinx products, but FPGA only generated & adapted with IP Integrator

ORUM

Hzürich

FOSS, technology-independent implementation?



# **ETH Zurich PULP platform AXI**

- FOS, technology-independent
- AXI4 and AXI4-Lite synthesizable IPs in SystemVerilog
  - Written and optimized by hand
- Extensive verification infrastructure
  - UVM-compatible
- Full architectural description and extensive documentation
- Fully customizable and extensible
  - User signals are routed
  - Achieve best performance by customizing to the application

# **AXI Architecture / Terminology**

### 5 independent transaction channels

- Valid, ready, last handshake
- Components
  - Master, slave, interconnect
- Master initiates AXI operation to slave
  - Set of required op. -> transaction
  - Burst of individual data beats







# **AXI Multiplexer**

- Connect multiple slave ports to one master port
- Operation:

ETH zürich

- Multiplexing forward channel
- Fair round-robin arbitration
- Demultiplexing backward channel

### Complexity: backward channel

- Critical path: O(log S) (arbitration)
- Area: O(S) (arbitration)





# **AXI Demultiplexer**

- Connects one slave port to multiple master ports
- Operation:
  - Externally select master port
  - Store id information to route reordered responses



- Complexity: keep ordering
  - Critical path: O(M), O(I)
  - Area:  $O(M), O(2^{I})$

# AXI Crossbar (X-bar)

- Connects N master ports to M slave ports
- Operation:
  - Address decoding (slave ports)
  - Master selection (demultiplexer)
  - Multiplexing
  - Optionally: add cuts, error slave
- Complexity:
  - Critical path: O(M + I) (demux)
  - Area:  $O(MS + 2^{I}S)$  (S demux, M mux)





ETH zürich

# Additional Design IPs

- ID remapper and serializer
- Data Upsizer and downsizer



- Simplex and duplex on-chip SRAM controller
- AXI-attached last level cache (LLC)
- Multi-channel AXI DMA engine



ETH zürich

# **Building large Systems from our IPs**

- Our IPs are written and optimized by hand in SystemVerilog
- Naturally: use SystemVerilog to create the AXI system
  - AXI has many signals -> Tedious and error-prone process
  - We provide a macro-based solution to create AXI types and connect buses
  - Even with SV generate constructs: limit fast exploration
- One solution: Use HLS to describe topology (and generate V code)
  - We would have to throw away our optimized IPs  $\ensuremath{\mathfrak{S}}$
- Solution: use template-based HLS strategy

# Solder: Template & IP-based HLS

### Python-based application

Configuration file for:

ETHZürich

- Parameters, Addresses
- Other user-defined constants
- Mako templateing for SV base
- Address maps, routes propagated and sanity checks performed
- Interconnect, SoC, testbench, documentation, linker script, ... generation
- Generates understandable (modifiable) SystemVerilog

config file

(JSON)

solder.py

template

(SystemVerilog)

output

(SystemVerilog)



NOCS 2021 - Special Session II - October 15, 2021



## **Future Development: Fast Exploration**

- Solder generates the Interconnect from
  - Config file and SystemVerilog template
- It has the full overview of the instantiated AXI IPs and their configuration
- From fitted models it is possible:
  - Estimate the critical path of each IP
  - Estimate the size of each IP

ETH zürich

- Estimate timing and area of full AXI system
  - ➤Do ultra fast (automated) exploration

## **Future Development: AXI Extensions**

### Error correction

- Space-grade applications
- Use redundant data (e.g. parity)
- User Signals
- Create / check integrity at source / sink

### Memory stream-based extensions

- Custom burst types
- Encode more complex memory streams
  - N-D transfers with regular strides
  - Scatter / gather operations with arbitrary memory streams



**PULP in Space** 

## Conclusion

ETH zürich

- We created a synthesizable FOS AXI4 implementation, that can compete with industry-grade solutions
  - Check it out at <u>https://github.com/pulp-platform/axi</u>
- Our AXI4 implementation is written in SystemVerilog, optimized by hand and fully characterized
- We have a template-based HLS approach to create SoCs and their AXI4 interconnects
  - Check out our reference RISCV system: <u>https://github.com/pulp-platform/snitch</u>