#### Clock Gating for Power Optimization in ASIC Design Cycle: Theory & Practice

Jairam S, Madhusudan Rao, Jithendra Srinivas, Parimala Vishwanath, Udayakumar H, Jagdish Rao SoC Center of Excellence, Texas Instruments, India (sjairam, bgm-rao, jithendra, pari, uday, j-rao) @ti.com

## AGENDA

- Introduction
- Combinational Clock Gating
  - State of the art
  - Open problems
- Sequential Clock Gating
  - State of the art
  - Open problems
- Clock Power Analysis and Estimation
- Clock Gating In Design Flows



## AGENDA

#### Introduction

- Combinational Clock Gating
  - State of the art
  - Open problems
- Sequential Clock Gating
  - State of the art
  - Open problems
- Clock Power Analysis and Estimation
- Clock Gating In Design Flows



#### **Clock Gating Overview**



### **Clock Gating Overview**

System level gating: Turn off entire block disabling all functionality.
Conditions for disabling identified by the designer





## **Clock Gating Overview**

System level gating: Turn off entire block disabling all functionality.
Conditions for disabling identified by the designer



- Suspend clocks selectively
- No change to functionality
- Specific to circuit structure
- Possible to automate gating at
- **RTL or gate-level**





• Clock network power consists of



 Clock network power consists of — Clock Tree Buffer Power



- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires



- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power



- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power
- Leaf-levels drive the highest capacitance in the tree



- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power
- Leaf-levels drive the highest capacitance in the tree





- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power
- Leaf-levels drive the highest capacitance in the tree
- ~80% of the clock network dynamic power is consumed by the leaf driver stage





- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power
- Leaf-levels drive the highest capacitance in the tree
- ~80% of the clock network dynamic power is consumed by the leaf driver stage
  - The clock pins of registers are considered as loads





- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power
- Leaf-levels drive the highest capacitance in the tree
- ~80% of the clock network dynamic power is consumed by the leaf driver stage
  - The clock pins of registers are considered as loads
  - Leaf cap = wire cap + (constant) pin cap





- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power
- Leaf-levels drive the highest capacitance in the tree
- ~80% of the clock network dynamic power is consumed by the leaf driver stage
  - The clock pins of registers are considered as loads
  - Leaf cap = wire cap + (constant) pin cap
  - Good clustering during synthesis reduces wirecap





- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power
- Leaf-levels drive the highest capacitance in the tree
- ~80% of the clock network dynamic power is consumed by the leaf driver stage
  - The clock pins of registers are considered as loads
  - Leaf cap = wire cap + (constant) pin cap
  - Good clustering during synthesis reduces wirecap

Clock network consumes 30-50% of the total dynamic power of the chip



- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power
- Leaf-levels drive the highest capacitance in the tree
- ~80% of the clock network dynamic power is consumed by the leaf driver stage
  - The clock pins of registers are considered as loads
  - Leaf cap = wire cap + (constant) pin cap
  - Good clustering during synthesis reduces wirecap
- Effective clock gating isolates this leaf level buffers and cap, providing large dynamic power savings

Clock network consumes 30-50% of the total dynamic power of the chip



- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power
- Leaf-levels drive the highest capacitance in the tree
- ~80% of the clock network dynamic power is consumed by the leaf driver stage
  - The clock pins of registers are considered as loads
  - Leaf cap = wire cap + (constant) pin cap
  - Good clustering during synthesis reduces wirecap
- Effective clock gating isolates this leaf level buffers and cap, providing large dynamic power savings
- Larger savings with CGs higher up in the tree

Clock network consumes 30-50% of the total dynamic power of the chip



- Clock network power consists of
  - Clock Tree Buffer Power
  - Clock Tree dynamic power due to wires
  - CLK->Q sequential internal power
- Leaf-levels drive the highest capacitance in the tree
- ~80% of the clock network dynamic power is consumed by the leaf driver stage
  - The clock pins of registers are considered as loads
  - Leaf cap = wire cap + (constant) pin cap
  - Good clustering during synthesis reduces wirecap
- Effective clock gating isolates this leaf level buffers and cap, providing large dynamic power savings
- Larger savings with CGs higher up in the tree
  - A trade-off with timing

Clock network consumes 30-50% of the total dynamic power of the chip



### **Clock Gating and Power consumption**

- Power dissipation of a flop due to clock toggles lies in it's CLK-Q transition power arc
- Disable the clock to a flop when the D pin does not toggle
  - Disable the CLK-Q arc
  - Identify all the D Pin non-toggle scenarios
- Can non-toggling of a D-pin be used to find gating scenarios across the clock boundary
  - Multi Cycle Scenarios



#### **Construction of a Clock Gate**





# AGENDA

- Introduction
- <u>Combinational Clock Gating</u>
  - State of the art
  - Open problems
- Sequential Clock Gating
  - State of the art
  - Open problems
- Clock Power Analysis and Estimation
- Clock Gating In Design Flows



## **Combinational CG: State-of-the-art - 1**

- Compile the logic (RTL or netlist) and detect a structural scenario leading to data gating
  - Identify Load-enable registers
- Most common is the mux-feedback loop (MFL) from an output to an input of a flop
- Reduces datapath delay and area





#### **Combinational CG: State-of-the-art - 2**

- Identify registers with low data activity
- Additional CGs would cost area
  - Grouping registers and building an XOR tree, introduces a single CG for the group
- To guarantee power reduction, method should be based on placement information
  - Timing and congestion are affected





# **Combinational CG: Open Problems**

- Activity driven clock gating
  - Clock gating should be done if it helps improve overall power, based on switching activity
  - There can exist more than one scenarios that need to be optimized
  - Clock gating should not be done for high switching activity registers
- Placement-driven optimisation
  - Cloning/Merging of clock gates
- Observability Don't Care
  - Registers whose outputs are not observable, during a clock cycle, should be isolated
- Leakage/Static Power Impact
  - All clock gating techniques should comprehend total power



#### **An ODC Illustration**





# AGENDA

- Introduction
- Combinational Clock Gating
  - State of the art
  - Open problems
- Sequential Clock Gating
  - State of the art
  - Open problems
- Clock Power Analysis and Estimation
- Clock Gating In Design Flows



#### **Sequential Gating : State-of-the-art - 1**

- Ability that can 'observe' a logic path beyond a <u>clock-to-clock boundary</u>
- Scenarios
  - De-Assert a data path if its forward stage is gated
  - De-Assert forward stage, if the current stage is gated
- Advantages
  - Apart from sequential power savings, combinational logic cones can also be gated



## **Sequential Gating : State-of-the-art - 2**

**Observability based CG (Backward Traversal)** 





Source : Mitch Dale, http://www.chipdesignmag.com/display.php?articleId=915





### **Sequential Gating : State-of-the-art - 3**

Input-Stability based CG (Forward Traversal) din\_1  $f_1$   $f_2$  Original RTL vld\_1 dout dout



Source : Mitch Dale, http://www.chipdesignmag.com/display.php?articleId=915





## **Sequential Gating: The Next Leap**

- Pushing up the abstraction levels
   The ESL Platform
  - The ESL Platform
- Compilation paradigms for ESL to identify sequential opportunities at RTL
- Power Aware ESL coding styles to ease RTL clock gating
- Verification Requirements : A critical enabler
  - Alteration to pipelines means a change in functionality
  - Hence the need to verify the optimized RTL
  - Formal Approaches gaining precedence over simulation based methods



# AGENDA

- Introduction
- Combinational Clock Gating
  - State of the art
  - Open problems
- Sequential Clock Gating
  - State of the art
  - Open problems
- <u>Clock Power Analysis and Estimation</u>
- Clock Gating In Design Flows



#### **Power Estimation Methodology**



## **Power Estimation Methodology**

- Estimation needs to be performed at RTL, netlist and physical design stages
- One constant input at every stage of estimation is the switching profile of the circuit
  - Ideally, a peak power "testcase" switching profile is desired for both optimisation and estimation
  - However, there could be multiple application scenarios which consume similar power, but with different switching profiles



# **Power Estimation Methodology**

- Estimation needs to be performed at RTL, netlist and physical design stages
- One constant input at every stage of estimation is the switching profile of the circuit
  - Ideally, a peak power "testcase" switching profile is desired for both optimisation and estimation
  - However, there could be multiple application scenarios which consume similar power, but with different switching profiles
- Switching profiles are derived from simulation of circuits with appropriate testbenches - costly to do multiple times in the implementation cycle



# **Power Estimation Methodology**

- Estimation needs to be performed at RTL, netlist and physical design stages
- One constant input at every stage of estimation is the switching profile of the circuit
  - Ideally, a peak power "testcase" switching profile is desired for both optimisation and estimation
  - However, there could be multiple application scenarios which consume similar power, but with different switching profiles
- Switching profiles are derived from simulation of circuits with appropriate testbenches costly to do multiple times in the implementation cycle
- <u>Can the source RTL simulation activity for each</u> <u>scenario be used consistently at all stages?</u>



## **Capturing Simulation Data**



#### **Clock Gate Analysis Metrics Formulation**





## **Clock Gate Analysis Metrics Formulation**

• Metric should address the following concerns:

- How good is a current implementation?
   Effectiveness of a clock gate
- How much is left on the table?
   Granularity of sequential sinks
- How much can be obtained out of the available?
   Quality of a gating signal



## **Metric Definitions - 1**



**JS/BGM – ISLPED08** 

**INSTRUMENTS** 

# **Metrics Definitions - 2**

- Clock Gating Efficiency (CGE)
  - Length of time CG is asserted to disable the clock
  - Average % of time each register is gated
- Data Non-Toggling Ratio (DNT)
  - Active clock is defined as the percentage time clock reaches a sequential sink
  - DNT defined as % time data is non-active for an active clock
- Clustering Efficiency
  - Quality of 'enable' in proportion to correlation of enable logic to the sequential cluster



# AGENDA

- Introduction
- Combinational Clock Gating
  - State of the art
  - Open problems
- Sequential Clock Gating
  - State of the art
  - Open problems
- Clock Power Analysis and Estimation
- <u>Clock Gating In Design Flows</u>



# Additive CG Gain in RTL2GDSII

- Given the list of available methods, we need a design flow which:
  - Is additive in power savings
  - Provides a seamless interface for design tools
  - Has ability to integrate (and also generate) switching scenarios at all design stages to enable activity base optimization
  - Provides a power estimation framework at all design stages to aid optimization



- RTL Design
  - Apply sequential gating at RTL design stage
  - Verify RTL post sequential clock gating
  - Verify power savings
- Synthesis
  - Apply combinational clock gating
  - Apply cluster constraints based of fan-out/bitwidths
  - Apply CG optimization based on activity
- Physical Design
  - Validate cluster efficiency based on layout
  - Add/Refine enable logic, based on cluster refinement

## Results

- Proposed methods were applied to a 65nm data flow centric IP (~400K)
  - A very power sensitive application needing optimization for different use modes
  - Optimization was needed to be performed across multiple use case scenarios
- Analysis showed ~40% of total dynamic consumption in the clock network
  - Hence scope for power reduction through clock gating



#### **Incremental Power Savings**

| # | Stage     | Method                                        | Savings |
|---|-----------|-----------------------------------------------|---------|
| 1 | Synthesis | Combinational<br>(MFL)                        | 50%*    |
| 2 | RTL       | Sequential                                    | 15%     |
| 3 | Placement | IO Exclusivity                                | 6%      |
| 4 | CTS       | Cluster Refinement<br>& CTS<br>Implementation | 4%      |

• \* Savings reported over a non clock gated design. This can vary across designs





## References

- Automatic synthesis of low-power gated-clock finite-state machines, Benini, L.; De Micheli, G.; IEEE Trans. CAD Volume 15, Issue 6, June 1996 Page(s):630 – 643
- New clock-gating techniques for low-power flip-flops, Strollo, A.G.M.; Napoli, E.; De Caro, D; Proc. ISLPED 2000 Page(s):114 – 119
- DCG: Deterministic clock-gating for low-power microprocessor design, Hai Li; Bhunia, S; Yiran Chen; Roy, K.; Vijaykumar, T.N.;IEEE Trans. VLSI Systems, Volume 12, Issue 3, March 2004 Page(s):245 - 254
- Guarded evaluation: pushing power management to logic synthesis/design; Tiwari, V.; Malik, S.; Ashar, P.; IEEE Transactions on CAD, Volume 17, Issue 10, Oct. 1998 Page(s):1051 - 1060
- Power Compiler Manual, Synopsys Inc.



## **THANK YOU**



JS/BGM – ISLPED08