The criticality of performance per watt optimisation for AI chip development
CHIP DEVELOPERS are seeing an urgent rise in demand for compute processing capability driven by AI workloads. This increase in compute requirements drives a corresponding increase in the demand for power consumption.
For example, a ChatGPT query requires nearly 10 times as much power, on average, as a Google search [1].
Power has traditionally been treated as a secondary constraint, with performance taking precedence during development. It is no longer feasible to leave power optimization until the end of the design cycle. The performance per watt metric is now of critical importance for AI chips and chiplets and must be addressed throughout the development process. Hyperscalers now often revise their metrics to be “tokens/watt.”[2]
Architecting for power efficiency at the earliest possible stage of the chip design process will maximize power-saving potential. Using shift-left methodologies, 30% to 50% power savings can be achieved at the software and hardware architecture phase, compared with single-digit power savings during implementation and signoff stage [3].
An end-to-end silicon to systems solution enables designers to optimize power early in the design cycle while achieving performance goals.
Top Challenges in AI Chip Efficiency
AI chip developers face four main challenges:
- Power efficiency and thermal management
- Memory bandwidth and data movement
- Architecture analysis
- Optimizing hardware and software
for the representative workload
Power efficiency and thermal management
AI and other demanding applications are driving the use of multi-die systems and semiconductor devices with multiple homogeneous or heterogeneous dies within a single package. This enables the rapid development of tailored silicon solutions for high-performance computing applications [4].
Heat dissipation is one of the main challenges in designing a multi-die AI system, creating thermal limitations. A well-planned architecture following an iterative process can alleviate thermal stress by exploring options at the front end to avoid getting locked into a partitioning structure that could eventually turn out to be sub-optimal from a power perspective.
System architecture teams can use modeling tools to abstract out pieces of a chip into models for performance/power analysis and finalize power tradeoffs before the design is locked into its partitions. By mapping a workload onto a multi-die system, the design team can determine the activity per processing element and per communication path. Modeling the hardware and software together is key to generating a robust and thermally efficient design, with scalable software across the die.
Memory bandwidth and data movement
AI applications thrive on high memory bandwidth, fast throughput, and low latency. The growth of bandwidth has not kept up with the growth of compute. For chip designers, overcoming the initial challenge of “The Memory Wall” - the gap between processor speed and memory bandwidth - in AI/ML chip design is paramount. For AI chips, one of the leading causes of power consumption is data movement - even more than compute - and high-speed die-to-die communication needed to pass large data sets between dies within a chip.
Developers must analyze data movement early and identify solutions to optimize memory power and minimize data movement to achieve the highest performance per watt.3 Solutions include high bandwidth memory, analog computing, custom compute units, compute in memory, resistive RAM structure, and algorithmic solutions like sparse algorithms, to eliminate unnecessary data movement, to eliminate unnecessary data movement.
Design teams should focus on identifying the right memory architecture to minimize data movement, analyzing these architectural changes at an early stage and implementing them.
Architecture analysis
Before the start of the power design cycle, it’s critical to architect a power analysis flow that aims to analyze power as early as System Architecture Stage.
Synopsys Platform Architect™, a performance and power analysis tool, enables accurate simulation of system-level function in SystemC, providing crucial early insights into power-performance tradeoffs before hardware descriptions in Verilog have been written.
Platform Architect helps system designers explore and optimize the hardware-software partitioning and the configuration of the System-on-Chip (SoC) infrastructure, focusing on global interconnect and memory subsystem to achieve the right system performance, power and cost. This process helps in deciding the efficient macro architecture of the system, including, but not limited to, technologies such as Dynamic Voltage and Frequency Scaling (DVFS), Power Gating, Network-on-Chip (NOC) traffic, etc. Using transaction-level simulation, Platform Architect reduces design time by predicting and optimizing architecture KPIs.
Optimizing hardware and software for the workload
Optimizing both the hardware and software for the specific workload is critical; therefore, developers must model, simulate, emulate and prototype chip performance prior to hardware returning from the fab.
Leveraging early architecture analysis and performance validation in emulation will result in efficient hardware/software partitioning and customized hardware/software for a very specific workload, such as Instagram workload-specific Application-Specific Integrated Circuits (ASICs).
Tools such as Synopsys ZeBu® Empower can be used for workload profiling across multiple vectors to identify the right window and workload for power analysis and optimization. An efficient combination of major benchmark workloads, such as Idle/Sustained/Inference/Training, will be the desired collateral for this analysis and optimization.
Synopsys Solutions for High-Performance, Power-Efficient AI Chips
The challenges faced by chip developers illustrate the need for power optimization throughout the design flow, with a strong emphasis on shift-left methodologies. The potential for power savings diminishes significantly by the implementation phase if not addressed throughout the process. Synopsys’ end-to-end power solutions allow for power analysis and optimization at every stage of development—architecture, emulation, functional prototyping, design and verification, implementation and test, engineering change order (ECO), and signoff.
Workload profiling
Key workload profiling and identifying the right workload and window are essential. Synopsys PrimePower™ RTL, along with functional workloads from verification and simulation tools (VCS) or profiled workloads from ZeBu Empower, enables logic designers to design a power-efficient RTL. The Synopsys VCS is the primary verification solution used by the world’s top semiconductor companies. VCS provides the industry’s highest performance simulation and constraint solver engines, allowing users to easily speed up high-activity, long-cycle tests by allocating more cores at runtime.
ZeBu Empower analyzes billions of cycle workloads from ZeBu emulation and identifies optimal smaller windows for power analysis and optimization. ZeBu Empower generates workloads for analysis and optimization, revealing opportunities for dynamic and leakage power early in development, reducing the risks of power bugs and missed SoC power goals. ZeBu Empower enables multiple iterations per day with actionable power profiling in the context of the full design and its software workload, optimizing average power, peak, IR drop, wasted clock pin power and others.
Power analysis
Once RTL has been implemented, the PrimePower product family enables accurate power analysis for block-level and full-chip designs, beginning with RTL through the different stages of implementation, leading to power signoff.
Running power analysis on a flat design is challenging due to memory and runtime requirements. Power analysis and optimization must be performed hierarchically and must have a robust methodology in place.
Further, multi-die design and chiplets require many more connections than traditional monolithic chips, and this increased interconnect density creates power distribution challenges, requiring some advanced routing capabilities. Performing power integrity across heterogeneous components is more challenging in chiplets due to the complex geometry and relationship between power and temperature.
During implementation and signoff, PrimePower provides accurate gate-level power analysis reports for SoC designers, enabling timely optimization and power target achievement through various metrics such as average, peak, glitch, clock network, dynamic and leakage power and multi-voltage power. The workload for this activity can be derived either from simulation or emulation.
As designs become more complex, designers need a tool that pinpoints the major power sinks while
suggesting modifications with the highest return on investment.
Synopsys’ PrimePower product family provides accurate power analysis for block-level and full-chip designs, improving power efficiency through each stage, shortening the design cycle.
Developers can leverage PrimePower RTL to assist in:
- Power estimation
- Power profiling and distribution
- Identifying specific RTL lines to designers for modification of the architecture and re-analysis
Fixing the glitch
Synopsys technologies address the problem of glitches, where unnecessary signal transitions in a combinational circuit drain power, leading to a significant contributor of dynamic power.
At lower geometry nodes, due to an imbalance in delay between gate and net, most data path-sensitive designs will exhibit a higher percentage of glitches. The majority of glitch sources can be optimized at the architecture level, and final glitches caused due to imbalanced paths can be optimized during the ECO cycle.
Solutions to suppress glitches are needed at every step of the cycle.=
- PrimePower RTL computes and identifies sources of glitch early in design cycles, which enables designers to rearchitect to reduce glitch. It can also point to the RTL source lineof code generating the highest level of glitch. Optimizing glitch power for a tile can lead to high power savings at the SoC level
- The PrimePower solution offers delay-/glitch-aware vector generation using RTL simulation. The product can generate a Switching Activity Interchange Format (SAIF) with glitch annotation (Inertial Glitch and Transport Glitch) and a delay-aware SAIF from an RTL simulation on any given netlist
- The SAIF or Fast Signal Database (FSDB) can be used during implementation with Synopsys Fusion Compiler™ or during ECO with PrimeClosure to perform glitch-aware optimizations and reduce glitches following the timing phase
- Finally, PrimePower gate-level power analysis and golden power signoff perform glitch power analysis using timing-aware simulation correlating closely to SPICE power numbers
Optimisation
During Place and Route Developers can optimize their design for power during synthesis and the place-and-route stage of the design cycle using technologies such as clock gating, power aware placement, clock tree synthesis, logic restructuring and many more power optimization technologies in Fusion Compiler.
Voltage plays a dominant role in dynamic power consumption via CV2. Reductions in voltage, even if minor, can lead to a quadratic reduction in power. Fusion Compiler’s VoltOpt feature enables voltage-based analysis and optimization, and it can be used in the implementation phase to reveal the minimum voltage required to maintain the same timing/area. Once Vmin is identified, designers can redesign their logic to operate at this lower voltage, significantly reducing power. Tools such as Synopsys PrimeShield™ can be used for voltage slack/robustness analysis to ensure that Vmin qualified by VoltOpt is optimal.
Timing/Power ECO using PrimeClosure will be the last piece of the puzzle in squeezing last nW of power before tapeout.
A signoff-level power analysis tool, such as PrimePower, is key to determining any available power recovery at the end of the cycle. This provides developers with a view of the power they will see in the final silicon.
Power optimisation through Silicon Lifecycle Management (SLM)
The performance of a silicon chip does not remain constant over its operating life. Aging effects in the silicon structures and factors in the system operating environment change the performance characteristics of the device over time. Silicon Lifecycle Management (SLM) provides a data collection, analysis and control environment to monitor these effects and implement corrective actions in the field at a unit device level. The result is a far more stable and secure performance of the silicon device and overall system over time. Synopsys’ integrated SLM family of products is built on a foundation of enriched in-chip observability, analytics and integrated automation and improves silicon health and operational metrics at every phase of the device lifecycle.
Through monitoring, deep insights are obtained from silicon to system, enabling meaningful data collection at every opportunity for continuous analysis and actionable feedback.
Summary
The escalating demands of AI workloads require a fundamental shift in AI chip development - one that prioritizes power efficiency from the inception of the architectural design. Power modeling is a key power analysis and optimization challenge for multi-die design. The limitations imposed by thermal constraints in multi-die systems, the energy drain of extensive data movement, and the imperative for workload-specific hardware and software optimization underscore the critical need for early and continuous power analysis and mitigation.
Synopsys’ comprehensive suite of tools and the tangible results achieved by innovators like SiMa.ai demonstrate that a proactive, shift-left methodology, coupled with robust power analysis and optimization throughout the design life cycle, is not merely advantageous but essential for achieving the high-performance, energy-conscious AI chips necessary in this new era of artificial intelligence.































