User Stories – The OpenROAD Project

Implementation of RISCduino core using a Hierarchical Design Flow

sivaganesh — Thu, 05 Jan 2023 09:28:05 +0000

Dinesh Annaya is an ardent Open-Source EDA enthusiast and an expert user of OpenROAD and OpenLane. He developed a baseline RISCduino SoC, a single, 32 bit RISC-V based controller compatible with the Arduino platform . He has submitted over 15 designs on Open MPW shuttles on sky130- https://github.com/dineshannayya/riscduino. During the course of his design journey, he successively improved the design architecture for better performance, and enhanced functionality. His main motivation for the use of Open-Source EDA tools is to gauge quality of results and potential for commercial use.

A flat design approach forces design implementation to a single module which increases runtime and design complexity. Dinesh uses a hierarchical design flow methodology to reduce runtime, memory usage, and to meet his design, performance and area goals for implementation on the Caravel top-level SoC.

Continuous Architectural and Design Improvement

The hierarchical design flow methodology using OpenROAD and OpenLane significantly reduces runtime and eases design complexity for the target user area die size (10 mm2) and pre-defined pin constraints of the Caravel GPIO. Dinesh implemented three derivatives of the main RISCduino core: single, dual and quad, as shown in the figure below.

Find details here: https://github.com/dineshannayya/riscduino#readme

For his last design iteration, Dinesh was able to achieve a significant performance improvement (100Mhz at typical corner) with increasingly dense designs and a high utilization. Shown below are 21 blocks or Macros at the top-level SoC.

This allowed him to focus on good block-level implementations, which after hardening as Macros were easily integrated at top-level. This also vastly simplified top-level routing and timing closure.

Implementation Using a Hierarchical Flow

Dinesh employs a hierarchical instead of a flat design methodology to better manage block-level performance for faster runtimes and better usage of memory. He uses a combination of a top-down approach for design partitioning, time budgeting and top-level placement, and a bottom-up approach to harden Macros, perform SoC integration and achieve final design convergence.

Here are the main steps:

Design Partitioning and Block-level Constraints

- Design partitioning is based on functionality to minimize interconnect signals and combinational logic between the blocks. Here are some guiding rules that Dinesh used:
  
  Rule-1: Too many design components fragment the floorplan, making it difficult to floorplan and close top-level timing. Group smaller and similar components of the functional units into blocks each less than 0.5mm2

Rule-2: A flat design approach leads to longer RTL-GDSII flow runtime and timing closure challenges at chip level. Logically partition the design into multiple blocks around 0.5mm2 each.

Macro Placement

In this case a manual Macro placement is used to give better pin placement, block-level interconnects and feedthroughs for routing efficiency.

Rule 1: Manual Macro pin placement gives better global routing. Use OpenROAD to preview Macro connectivity and rearrange Macro pin placement.

Rule 2: Add Feedthrough partition to connect blocks to top-level i/o for congestion-free routing

Feedthrough paths are defined from top-level I/O pins into and through the blocks to reduce long routes and congestion. Repeaters are added to these paths to maintain signal strength and avoid max slew and fanout violations.

These feedthrough partitions are manually inserted at 4 corners of the design. The partition I/O signals and position are based on the physical location of the corresponding top-level I/O ports and Macro pins. Finally, these partitions are hardened in met-3, so that they do not block global met-4/met-5 PDN stripes. Timing for feedthrough paths is analyzed by extracting the SPEF parasitics of the paths inside the partition and running timing analysis in OpenLane at top-level.

Time Budgeting and Setting Constraints

Dinesh estimated I/O budgets for each block using a good rule-of-thumb and subsequently re-adjusted the block-level SDC based on top-level hierarchical timing analysis.

Rule-1: Create Block level SDC with I/O Setup delay constraints at Macro ports. Allocate: 60% for external delay with 40% total for block + 20% interconnect. Hold delay constraints:1ns External delay

Rule-2: Run hierarchical timing analysis at top-level; if there are violations, try to re-adjust the I/O timing of the Macro SDC, re-harden it, and re-analyze the top-level timing. This is generally an iterative step until all constraints are met.

Hierarchical Timing Signoff

For MPW2-6 shuttle submissions, Dinesh used his custom top-level scripts using Macro spef + Standard cell .lib to do hierarchical time analysis. An example script is available at:https://github.com/dineshannayya/riscduino/blob/master/sta/scripts/caravel_timing.tcl

Note: Efabless MPW-2 Silicon debug exposed an RCX extraction issue in the hierarchical design and since then the Efabless team revised the tiiming script. From MPW-7 onwards Dinesh used the default timing script. Read this for more information:

https://caravel-user-project.readthedocs.io/en/latest/#running-timing-analysis-on-existing-projects

Flow Summary

Here is a summary of the flow steps:

Design Partitioning
Time budgeting and defining initial constraints
Floorplanning
1. Macro placement and feedthroughs
2. Fine tuning SDC constraints
3. Power network generation
Clock tree synthesis at top-level
1. Balance clock skew
2. Add repeaters as needed
Harden each Macro RTL-GDSII implementation in OpenLane
Top-level integration
1. Hooking up Macros to top-level
Chip-level signoff
1. Load verilog files for all levels of hierarchy
2. Load Macro & Top-level SPEF files
3. Run top-level, flat timing analysis for all three corner
4. Make sure that there are no hold violations in 9 corners –Library (Fast/Typical/Slow) Vs Spef (Max/Nom/Min)
5. Analysis the Max timing margin for each clock domain across each corner.

Key Design Strategies to achieve good PPA

Dinesh customizes the flow implementation to leverage many improvements to OpenROAD’s clock tree synthesis (CTS), router (DRT) and power network creation (PDN) to further improve productivity and QoR. Here are some interesting techniques he uses to improve his design to achieve good PPA with the given the flow capabilities:

Better Power management -- Multiple power regions lead to better use of routing resources

Step-1: Macro Power Pitch/Width changed from default 153um/1.6um to 100um/6.2um

Step-2: Reduce top-level PDN pitch from 153um to 100um and increase width from 1.6um to 6.2um to enable an efficient 9 multi-via hook-up from top-level to Macro.

PDN with 2 via vs 4 vias hookups

Now Macros are connected through 9 multi-cut vias compared to 2 vias for better reliability and lower resistance which resulted in lower IR drop

Step-3:Ensure that feedthrough partition is hardened within met-3 so that this will not create a blockage for top-level met-l4 and met-5 power stripes routing.

Since a repeater partition needs distinct power hook-up requirements compared to the rest of the Macros, he defines a separate power-domain for the power connections. Here is the pdn script:

https://github.com/dineshannayya/riscduino/blob/master/openlane/user_project_wrapper/pdn_cfg.tcl

Power grid on side blocks has thicker straps hence less IR drop.

Clock Tree Balancing

Currently, the OpenLane flow does not automatically balance clock skews across the Macros. Shown below is an example where each Macro has a different clock latency. Dinesh defines each Macro with16-tap adjustable clock skew buffers. Each Macro is hardened with skew adjusted to close timing at the top-level Caravel design.

Final Design Results

Using a customized, hierarchical flow, Dinesh was successfully able to meet his design goals. The table below shows the user area utilization for the riscduino_qcore design which has around 150K cells + 48 Kb SRAM (from sky130 pdk) . For the typical timing corner, the RISC-V Arduino core timing performance was met at an fmax of 100Mhz.

Block	Total Cell	Combo	Seq	Utilization
RISC (4 Core)	94165	79675	14490	45%
QSPI	9038	7525	1513	42%
UART_I2C_USB_SPI	11880	9011	2869	42%
WB_HOST	6511	5359	1152	45%
WB_INTC	6674	5263	1411	20%
PINMUX	11923	9318	1061	35%
PERIPHERAL	5847	4786	1061	42%
BUS-REPEATER	922	922	0	20%

TOTAL	146960	121859	25101

Final Routed Design

Figure below shows the final routed GDSII for the RISCduino Score, Dcore and QCore designs.

Conclusion

Dinesh summarizes his usage experience of OpenROAD and OpenLane as below:

“I highly appreciate the time and effort taken by the OpenROAD team in developing a VLSI design flow based on open-source concepts. Each of the VLSI design stages from RTL-GDSII flow needs specialized technology knowledge and in depth implementation strategy with coordinated effort and strategy. I see a continuous improvement in the OpenLane tool over each MPW shuttle.

I am highly impressed by the OpenROAD team. Response to users in terms of tracking key GitHub issues and overall response time to fix is better than a commercial tool vendor support team. I look forward to a successful commercial tape-out through the OpenLane flow and wish the OpenROAD team the very best for their innovation and mission in OpenEDA flows- in particular in enabling this design flow success as a unix and arduino initiative.

In my technical career I have noticed multiple commercial tool vendors developed an automated RTL to GDS flow and none of these were successful. Main reason for the failure is that every company flow, project & user requirement are unique and each project needs some customization which cannot be easily mapped into a one single RTL to GDS flow. My suggestion to the OpenROAD team is to have clear industry standard handoffs between each stage so that users can effectively use Open-Source EDA tools with commercial tools and Custom Scripts in their commercial projects. I also would like to see missing functionality like Logic Equivalence checking (LEC) and DFT support (JTAG, MBIST, SCAN) added.”

About Dinesh Annayya

Dinesh is an expert SoC designer and has worked in the VLSI industry for more than 20 years, at companies including Cypress Semiconductor, Centillium and Transwitch. Currently he is working as a design manager in Intel India Bangalore Centre. His design work spans multiple foundries including TSMC, Intel, GlobalFoundries, UMC and SMIC and multiple technology nodes including 180nm, 130nm, 90nm, 65nm, 55nm, 45nm, 22nm and 10nm. He has submitted 28 GH issues that have resulted in critical bug fixes that led to significant enhancements to tool features and quality.

In the future, he plans to extend RISCduino with other add-on chips for advanced functionality and fast interconnectivity interfaces such as QUAD SPI.

An OpenROAD based IC Design course for Spanish Learners

Safiq Ahamed — Thu, 15 Dec 2022 17:27:42 +0000

Professor Erick Carvajal teaches VLSI and microelectronics based courses at the University of Costa Rica. In 2021, he was actively looking to set up his undergraduate, Microelectronics course using conventional EDA tools. However, these were very expensive and difficult to set up for the class requirements given software licensing constraints and server resources limitations. Hence he started looking for other alternatives– he learned about OpenROAD through a Google sponsored presentation in 2021.
His goals were to provide easy-to-use and access EDA tools and course content to enable his students to learn basic IC design concepts and flows, both collectively and independently, within a semester. Most importantly, he wanted the flexibility to tailor the course for his Spanish students who did not have any other source of learning material. Erick also wanted to ensure that his students have a viable path to employment in the semiconductor Industry beyond the opportunities in Costa Rica. Hence he chose OpenROAD as the OpenEDA application of choice for his course that is ideal particularly for regionally underserved communities.
The usage of OpenROAD as a key OpenEDA source for VLSI education and curriculum for workforce development is rapidly growing since it provides easy, scalable and open access to the entire tool suite, flow control options and design cases on multiple technology platforms. This story features the exemplary work of Erick and his students who resourcefully took advantage of OpenROAD in this new era of open source based learning and in semiconductors.
OpenROAD actively engages with motivated Universities and researchers worldwide to provide training and curriculum support as part of its vision to democratize and spread the learning of VLSI and enable a path to skilled workforce development.

Course – Key Flow concepts with Hands-on learning

The semester-long course, from August-December, covered the entire IC design flow RTL-GDSII flow with key VLSI design and open PDKS using OpenLane, the flow controller by Efabless, based on OpenROAD , and Skywater 130nm. The course duration was a total of 16 weeks with 4 hrs/week allocated to lectures and lab discussions. Students completed labs online as homework. The prerequisite for this course was basic semiconductor devices knowledge as covered in the seminal textbook CMOS VLSI DESIGN: A Circuits and Systems Perspective By N. Weste and D. Harris
The course included three labs aimed at providing students with a depth of training on core concepts of the RTL-GDSII design flow using OpenLane. One such design example was a basic 8-bit adder.

Open Source learning offers limitless potential for creativity.

The current OpenLane flow does not contain a RTL schematic viewer. The team found a useful way to visualize the Yosys synthesized netlist using netlistsvg: (https://github.com/nturley/netlistsvg). This allowed students to view the netlist as a schematic for design exploration and specially to understand technology mapping visually. This tool takes a netlist in a JSON format (which can be generated in Yosys using the write_json command) and outputs an SVG image of the circuit. Figure below shows a 4 bit adder. The values on the signals (e.g. A_96, B_75) were personal identifiers assigned to each student to validate their individual work.

Diagram before technology mapping *

*A bug in netlistsvg renders the two output port directions incorrectly as inputs

Diagram after technology mapping *

*A bug in netlistsvg renders the two output port directions incorrectly as inputs

Design Exploration enables PPA comparisons

In the labs, students explored multiple synthesis strategies and design configurations based on changes to pin configurations, floor plan aspect ratios and various power grids for a fixed die size.
Students learned to experiment, analyze design choices and arrive at the best possible PPA results i.e area and the critical path delay based on comparisons to generated layout–all of which they could do so very rapidly in a matter of just a few hours. The OpenROAD GUI based detail analysis and visualization features allowed them to study the impact of design changes at various flow stages and thereby converge to the final layout.

GUI based Visualization for easy analysis and feedback

Here’s an example of a clock tree generated for a Huffman JPEG encoder displayed in the OpenROAD GUI: https://github.com/The-OpenROAD-Project/OpenLane/blob/master/designs/y_huff/src/y_huff.v

Clock Tree for a Huffman JPEG encoder

Macro Placement Exploration

Students also learned how to create a macro and then instantiate multiple instances at the top-level. Here’s an example- Macro for 4 bit adder first created at block-level and then instantiated in an 8 bit ALU. Students played with pin positions of macros and top level, as well as various cases for good macro placement.

Sample Layout created by students for the 8 bit ALU with macros

What’s Next?

Prof Erick Carvajal plans to offer a full semester based on an OpenROAD based capstone project in addition to the current course. He is also eager to collaborate and share his work with other Spanish learners. Interested users can reach out to him directly via his email: erick.carvajalbarboza@ucr.ac.cr. Course materials will be made available sometime in January 2023. So stay tuned for updates.
Here’s a summary of his experience in his own words about how OpenROAD is a powerful aid to Open Source learning and contribution.
“OpenROAD was essential for the class I taught. The OpenLane based flow was easy to set up, so the students were not stuck at unnecessary steps, and the ramp-up was very fast. Students were able to learn and gain practical experience with EDA tools: analyze reports, debug errors, optimize the design, and have a free and safe environment for experimentation and exploration. The concepts we covered have already helped some of the students to get good jobs at physical design roles in big companies, and sparked the curiosity for VLSI in others, who are already trying to get involved in bigger projects using OpenROAD. One of my students already working in the industry expressed to me that he was able to gain a better understanding of the tasks he was performing at his job. OpenROAD is, without a doubt, democratizing VLSI education and spurring research opportunities–it gave me the chance to teach a class I couldn’t afford.”

About Professor Erick Carvajal

Prof Erick Carvajal, received his Bachelors in Electrical Engineering at Universidad de Costa Rica in 2014, his Masters of Science in Electrical Engineering at The University of Texas at Austin in 2017 and his PhD in Computer Engineering from Texas A&M University in 2021, where his research was co-advised by Dr. Jiang Hu and Dr. Paul Gratz. His interests include the integration of Machine Learning techniques into the IC design flow to make designs faster and more efficiently, as well as the application of innovative techniques for teaching engineering classes.

AE-AV1 Encoder implementation: Using OpenROAD to achieve Real-time Throughput

sivaganesh — Fri, 09 Dec 2022 07:40:39 +0000

Tulio Pereira Bitencourt

OpenROAD is increasingly being used as the leading Open Source EDA solution by a large number of users in industry and academia who are starting to explore and build ASIC designs for a range of mainstream applications of today. Video-on-demand (VoD) is a rapidly growing market dominating >80% of current internet traffic. Video streaming applications demand fast performance to deliver real-time video at high quality, low latency and lower design costs. AV1 supports higher video resolution standards (e.g., 4K, 8K) to fulfill requirements for video size, new video coding standards but fails to meet real-time throughput.

The Problem

AV1, an Open Media (AO Media) video coding delivers good compression rates but does not meet real-time execution and throughput on software only implementations given its high complexity.

In order to develop the next generation of the encoder that meets the ultra-high performance needs (8K@120fps) for MRTR (Maximum Real Time Resolution), Tulio and his team at Informatics Institute, Federal University of Rio Grande do Sul, sought zero-cost OpenEDA solutions to explore and design enhanced design architectures to meet their design goals.

OpenROAD for AE-AV1 Arithmetic encoder design

OpenROAD enables free, open access to tools for RTL-GDS flows and open PDKs within 24 hours run times. This was important for Tulio to explore multiple design architectures to meet his design goals i.e. high performance, low cost (small die area) in the fastest possible time at multiple technology nodes.

The AE-AV1 , open, royalty-free, encoder implements arithmetic coding as a lossless data compression algorithm that improves upon its predecessor codecs – HEVC, VVC. VP9 etc. It optimizes key variables that depict a numeric interval (Low, Range) to encode incoming symbols into a reduced bitstream based on probabilities of their appearance.

The original AV1 lacked the ability to predict hardware implementation results since it relied heavily on dynamic arrays for an unknown set of input symbols. These unique and stringent requirements made OpenROAD the only viable solution to design AE-AVI with a good confidence for manufacturability..

Design Architecture

The team first developed a baseline design in RTL as a multi-stage pipeline shown in the figure. :

Stage 1 Receives symbols, number of symbols in the alphabet and probabilities, and performs pre-calculation

Stage 2 Updates Range and is the critical path. It is optimized by splitting it into Stages 1 (pre-calculations) and 3 (Low updating).

Stage 2 couldn’t be further accelerated due to self-feeding constraints by the Range variable.

Stage-4 is the hardware-friendly stage that implements carry propagation and stores the compressed stream in output registers.

The reason behind separating the updating process of Range and Low in two different stages is to avoid increasing the critical path and, hence avoid adding additional delays into AE-AV1.

Ease-of-Use: Easy Installation, Configuration for Rapid Exploration

“OpenROAD installation is fast and easy- docker based installation encapsulates the complexity of required packages and libraries. It is fantastic how easy it is to just execute a command and have the entire toolset installed and configured all at once, without requiring any intermediary step. ”, says Tulio.

“The scripts used for running the entire OpenROAD flow are extremely easy to use and straightforward to configure. The majority of the work, when one wants to get quick results, is just related to adding the targeted design into the OpenROAD ‘designs’ folder and editing the configuration file. Furthermore, upon designing an architecture, it should be a great idea for any researcher to just use the open-source solutions developed by the OpenROAD team to find the best possible configuration for the design just created, as well as to acquire results quickly to optimize parameters. OpenROAD goes from an RTL input, in my case, a bunch of Verilog files, to GDSII without any extra step necessary aside from triggering the flow.”

“The OpenROAD tools are extremely easy to use and require a very low time to set up. If one considers that a conventional tool requires a lot of infrastructure just to handle licenses, and even more to process the different tasks it supports, it is easy to conclude that running state-of-the-art paid EDA tools in a normal laptop would be unbearable. When running the OpenROAD flow, I used an older generation Dell Inspiron, which is not powerful and could barely handle the AV1 reference software (I had to boot my Linux OS without GUI for that). For OpenROAD, however, I executed everything on the same computer using an external hard-drive, which deprecates the performance even more. My computer did not struggle to run, and in almost no time the analyses were completed.”

To advance computational efficiency, OpenROAD leverages cloud resources to efficiently parallelize key stages in the design flow and distribute processes across multiple machines and CPUs.

Meeting Design Goals- High Throughput, High performance, Low Area

Achieving high performance at the least cost was the design goal–power was not considered to be a key PPA metric for this version of the encoder.OpenLane was initially used to explore design configurations and flow. However, Tulio chose OpenROAD-flow-scripts for its support of ASAP7 along with other Open PDKS (sky130, nangate 45) needed for exploration across technology nodes. OpenROAD-flow-scripts delivers the complete RTL-GDSII flow including yosys for synthesis, OpenSTA for timing analysis and optimization and klayout for DRC checking.

Rapid Design Exploration for optimal Area and Performance

Tulio was successfully able to run several design experiments based on targeted design configurations for multiple frequencies and process technologies including SkyWater130nm (HS, HD), nangate 45nm and ASAP7 predictive PDK.

OpenROAD supports design exploration through an OpenLane python script that automatically runs multiple, user-defined experiments based on different synthesis strategies to optimize area and performance. The table below depicts a sample experiment showing different results for the design to optimize gate count, area and the worst path delay for a given process.

Results

The final design implementation of AE-AV1 using ASAP7 shows a significant improvement in gate count and frequency, over the baseline AV1, with the target MRTR (Maximum-Real-time_Resolution) goal of 8K@120fps, for real-time processing at the maximum possible AV1 resolution. Table below shows the exploration and implementation results across multiple OpenPDKS.

ASAP7 delivered the best PPA, area and frequency improvements to area (24.48%) and frequency ( 82.8%) as compared to the Nangate 45nm PDK. Significant area and performance improvements were possible only at 45nm and lower nodes.The gates count (i.e., area), post-layout for all technologies was calculated by the actual area obtained by each circuit divided by the smaller two-input gate available on the PDK (i.e.,commonly a NAND-2 gate).

The final routed design implementation on ASAP7 is shown below.

Final Routed Design in ASAP7 in OpenROAD GUI

Tulio and his team were able to successfully meet their design goals using OpenROAD based flows, Open PDKs and the GUI, to explore and enhance the AV1 RTL design architecture and verify PPA at multiple technologies all of which were available within a fully integrated, easy-to-use and open ecosystem. They achieved these results within a significantly shorter period of time than what it would have taken with conventional EDA tools and at zero tool and PDK costs They published a paper to showcase their innovative research in this paper (include link).

“The OpenROAD toolset has a very well-structured flow, which can be easily configured by adding a design and editing the configuration file, if one wants quick results, or changing multiple parameters for achieving better results. For someone who was not familiar with the OpenROAD flow, I was very happy to find out that it was extremely straightforward to use and to reach a RTL-to-GDSII flow. The way OpenROAD allows for certain parameters to be kept as default, or be changed according to the needs of the user is incredible and allows designers to reach impressive results with state-of-the-art PDKs”., concludes Tulio.

References

AE-AV1 publication :https://jics.org.br/ojs/index.php/JICS/article/view/564

Baseline AV1 https://ieeexplore.ieee.org/document/9800932