

# Vivado Adapt 2021 Design Closure

Methodology, tips, and tricks for achieving better Quality-of-Results (QoR)

Feb 11, 2021

© Copyright 2021 Xilinx

## **Design Closure Sessions**



#### Session 1

Methodology, tips, and tricks for achieving better Quality-of-Results

#### Session 2

Using Timing Closure Assistance tools to address tough timing issues

#### Session 3

Power Constraints, best practices for an accurate Report Power estimation



## Agenda - Methodology, Tips, and Tricks



Vivado Tool and Methodology Updates

#### Synthesis

- Key Synthesis Features (2020.X)
- Tips and Tricks
- Implementation Updates
  - Key Implementation Features (2020.X)
  - Versal Implementation Guidelines





# **Tool and Methodology Updates**



## **Vivado Compile Time Improvements**

#### Synthesis speedup

- 2020.1: Constant RTL function compile times reduced to tiny fraction of 2019.2 time
- 2020.2: Average **20%** overall improvement compared to 2020.1 (Versal devices)
- Placer 20% average speedup on SSI designs with UltraThreads
  - Initial support (2020.2): enabled for Default, RuntimeOptimized, and Quick directives
  - Planned improvements (2021.x): enabled for other directives using place\_design -ultrathreads
  - Use with general.maxThreads >= number of SLRs
- Router **33%** average speedup in 2020.2 compared to 2019.2
  - Major initialization tasks are now completed offline
  - Improved SLR crossing routing algorithms

#### **Incremental Synthesis**

- Incremental Compile includes Synthesis, runs almost twice as fast! •
- Setup in Synthesis Options or use read\_checkpoint -incremental --- See UG901 ٠



#### **New Methodology Checks**

- UltraFast Methodology checks are built into Vivado reports
  - Access under Reports menu or Tcl command report\_methodology
  - Automatically generated in Vivado projects
- Review and correct or waive warnings and critical violations!



#### Recently added checks in 2020

| Rule ID   | Severity         | Description                                                       |
|-----------|------------------|-------------------------------------------------------------------|
| XDCB-6    | Advisory         | Timing constraint pointing to hierarchical pins                   |
| TIMING-54 | Critical Warning | Scoped false path or clock group constraint between clocks        |
| TIMING-56 | Warning          | Missing logically or physically exclusive clock groups constraint |



## Vivado: Feedback, Discussions, and Blogs

#### Help -> Leave Feedback

Community Forums

- Discussions
  - Categories for all tools
  - Experts and other users

#### - Design Blogs

- Written by Xilinx experts
- Most new features are introduced in blogs
- Leave comments at end!



## **Tool and Methodology Takeaways**

• Enable Incremental Synthesis with Incremental Compile to speed up iterations

Review Methodology Reports and correct or waive warnings and critical violations

Share feedback on Vivado, join online forum discussions, and review and comment on our blogs



# Key Synthesis Features (2020.X)



## **Expanded Language Support**

#### VHDL-2008 IEEE fixed-point and floating-point packages

- Now can target both packages using ieee statements instead of using an intermediate file
  - use ieee.float\_pkg.all;
  - use ieee.fixed\_pkg.all;

#### SystemVerilog: constant strings

- Strings can be used as parameters/localparams where the size of the string is fixed (Not to be used in logic)
- Support for methods Len(), Getc(), Toupper(), Tolower(), Compare(), Atopi(), Atohex(), Atooct(), Atobin(), Atoreal()
- Mixed language support passing generics and parameters in between VHDL and Verilog improved
  - Can handle multidimensional arrays/records/structs

## **Heterogeneous RAM Mapping**

Maps to a mix of LUTRAM, Block RAM, and UltraRAM for highest efficiency

- 2020.1: HDL attribute ram\_style=mixed added
- 2020.2: Pipeline register mapping
- 2021.1: Planned support for XPMs

Note: report\_ram\_utilization reports bit utilization percentage (depth x width util)



## **Logic Compaction Optimization**

Reduce slice utilization of low-precision arithmetic

- Available globally as a directive or per-hierarchy using BLOCK\_SYNTH cell property
- Supports both CARRY and LOOKAHEAD (Versal)

Versal example: 9x9 Multiply-Add, 3 stages



Default: timing-optimized 240 LUTs, 12 LOOKAHEADs 49 Slices





Smaller with Logic Compaction 186 LUTs, 27 LOOKAHEADs 40 Slices



#### **Ease of Use Enhancements**

SRL\_STYLE for static shift registers becomes a global option, additional usage now includes:

- Hierarchical cells

set\_property BLOCK\_SYNTH.SRL\_STYLE REG\_SRL [get\_cells mod\_inst]

- Tcl command

synth\_design -top <top\_name> -srl\_style reg\_srl\_reg ...

- Override KEEP and DONT\_TOUCH in RTL code
  - Set KEEP or DONT\_TOUCH false in XDC to optimize away
  - Useful when RTL code cannot be modified

If used in XDC files, limit USED\_IN to Synthesis - will generate critical warnings in Implementation due to constraints applied to optimized nets

| Value       | Style           |
|-------------|-----------------|
| register    | no SRL, all FFs |
| srl         | SRL only        |
| srl_reg     | SRL->FF         |
| reg_srl     | FF->SRL         |
| reg_srl_reg | FF->SRL->FF     |
| block       | block RAM       |

HDL:

(\* KEEP="true" \*) reg [255:0] debug\_signals;

Synthesis XDC:

set\_property KEEP false [get\_nets debug\_signals\*]

## Synthesizing for Versal: RAM Mapping

Fewer Block RAM aspect ratios: 8kx4, 16kx2, 32kx1 not supported

• UltraRAM initialization is supported, more aspect ratios: 8kx16, 16kx8, 32kx4

Block RAMs vs array sizes

| Depth                 | Width | UltraScale+ | Versal |
|-----------------------|-------|-------------|--------|
| 2 <sup>10</sup> (1k)  | 32    | 1           | 1      |
| 211 (2k)              | 16    | 1           | 1      |
| 2 <sup>12</sup> (4k)  | 8     | 1           | 1      |
| 2 <sup>13</sup> (8k)  | 4     | 1           | 2      |
| 2 <sup>14</sup> (16k) | 2     | 1           | 4      |
| 2 <sup>15</sup> (32k) | 1     | 1           | 8      |

UltraRAMs vs array sizes





## Synthesizing for Versal: DSP Block Mapping

- Complex Multiplier: 18x18 in 2 DSP blocks (3 required for UltraScale+)
- Dot Product: single DSP block holds 3 9x8 signed multiply-add

See Language Templates

Floating point modes: DSPFP32 instantiation required

| emplates                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Preview                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Q 🗶 🛊 🛊<br>Search: Q versal (28 matches)<br>Coding Examples<br>Complex Multiplier-Balanced Pipeline(Versal Only)<br>Complex Multiplier-Combinatoria(Versal Only)<br>Complex Multiplier-Combinatoria(Versal Only)<br>Complex Multiply Adder(Versal Only)<br>Dot Product(Versal Only)<br>Dot Product(Versal Only)<br>Complex Multiplier-Balanced Pipeline(Versal Only)<br>Dot Product(Versal Only)<br>Complex Multiplier-Balanced Pipeline(Versal Only)<br>Complex Multiplier-Combinatorial(Versal Only)<br>Complex Multiplier-Fully Pipeline(Versal Only)<br>Complex Multiplier-Fully Pipeline(Versal Only)<br>Complex Multiplier-Salanced Pipeline(Versal Only)<br>Complex Multiplier-Fully Pipeline(Versal Only)<br>Complex Multiply Accumulate(Versal Only)<br>Complex Multiply Accumulate(Versal Only)<br>Complex Multiply Adder(Versal Only) | <pre>1 2 // This module describes Dot Product Inference(Versal architecture) 3 // This module describes Dot Product Inference(Versal architecture) 3 // This module describes Dot Product Inference(Versal architecture) 3 // Three small multiplier(9x8 signed) a0b0+alb1+a2b2 can be packed into single DSP block 4 parameter AWIDTH = 9; 5 parameter AWIDTH = 8; 6 <wire_or_regs <a0="" [awidth-1:0]="" signed="">,<al>,<a2>; 7 <wire_or_regs <a0="" [awidth-1:0]="" signed="">,<al>,<a2>; 7 <wire_or_regs <clk="">; 9 reg signed [AWIDTH-1:0] <ao,<al>,<a2>; 10; 11 reg signed [AWIDTH-1:0] <ao,rl>,<al_rl>,<al_rl>; 12 reg signed [AWIDTH-1:0] <ao,rl>,<al_rl>,<al_rl>; 13 reg signed [AWIDTH-1:0] <ao,rl>,<al_rl>,<al_rl>; 14 reg signed [AWIDTH-1:0] <ao,rl>,<al_rl>,<al_rl>; 15 wire signed [AWIDTH-1:0] <ao,rl>,<al_rl>,<al_rl>; 16 wire signed [AWIDTH-1:0] <ao,rl>,<al_rl>,<al_rl>; 17 reg signed [AWIDTH-1:0] <ao,rl>,<al_rl>; 18 wire signed [AWIDTH-1:0] <ao,rl>,<al_rl>; 19 wire signed [AWIDTH-1:0] <ao,rl>,<al_rl>; 19 wire signed [AWIDTH-1:0] <ao,rl>,<al_rl>; 19 wire signed [AWIDTH-1:0] <ao,rl>,<al_rl>; 10 wire signed [AWIDTH-1:0] &lt;<al_rl>; 10 wire signed [AWIDTH-1:0] &lt;<al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></ao,rl></al_rl></al_rl></ao,rl></al_rl></al_rl></ao,rl></al_rl></al_rl></ao,rl></al_rl></al_rl></ao,rl></al_rl></al_rl></ao,rl></al_rl></al_rl></ao,rl></a2></ao,<al></wire_or_regs></a2></al></wire_or_regs></a2></al></wire_or_regs></pre> |  |

See AM004 for further details on Versal DSP





# Synthesis Tips & Tricks



## **Optimizing a Critical Multiplexer 1/2**

#### Multiplexer in the critical path

- Select path driven by counter->comparator
- Mux inputs are driven by adder-subtractor
- Can the multiplexer be optimized further?



reg [15:0] din0\_dly; reg [15:0] din1 dly; reg [7:0] counter; always@(posedge clk) begin if(en) counter <= counter + 1; end always@(posedge clk) begin din0 dly <= din0; din1 dlv <= din1: if(counter == 144) dout <= din0 dly + din1 dly; else dout <= din0 dly - din1 dly; end

endmodule

**E** XILINX.

Device : xcvc1902-vsva2197-1LP-i-S Frequency : 500 MHz

## **Optimizing a Critical Multiplexer 2/2**

Trick: replace counter - comparator with one-hot shift register

- No complex decode logic, single bit for mux selection
- Long shift register can be mapped to SRLs (LUTRAMs)



Device : xcvc1902-vsva2197-1LP-i-S Frequency : 500 MHz

reg [15:0] din0\_dly; reg [15:0] din1\_dly; reg [255:0] counter = {255'b0,1'b1};

#### always@(posedge clk)

begin if(en) counter <= {counter[254:0],counter[255]}; end

always@(posedge clk)
begin
 din0\_dly <= din0;
 din1\_dly <= din1;
 if(counter[144] == 1'b1)
 dout <= din0\_dly + din1\_dly;
 else
 dout <= din0\_dly - din1\_dly;
end</pre>

endmodule

| Version  | LUTs | FFs | WNS       | Critical Path                   |
|----------|------|-----|-----------|---------------------------------|
| Original | 24   | 56  | -0.298 ns | FF -> LUT -> 2 LOOKAHEADs -> FF |
| Modified | 24   | 49  | 0.232 ns  | FF -> 2 LOOKAHEADs -> FF        |

## **Optimizing Logical Comparisons 1/2**

Comparing two bit vectors: count and din

- Is count >= din?
- Critical path: add-sub -> 3-input mux -> comparator

How can the critical path be improved?



Device : xcvc1902-vsva2197-1LP-i-S Frequency : 500 MHz

reg [1:0] sel\_dly; reg [9:0] din\_dly; reg [9:0] count\_next; reg [9:0] count;

#### always@(\*)

```
begin
    case(sel_dly)
        2'bl0 : count_next = count + 1;
        2'b01 : count_next = count - 1;
        default : count_next = count;
    endcase
end
```

```
always@(posedge clk)
begin
    sel_dly <= sel;
    din_dly <= din;
    count <= count_next;
    dout <= count_next >= din_dly;
end
```



## **Optimizing Logical Comparisons 2/2**

Trick: move final comparison before 3-input mux

- 3 comparators in parallel, area tradeoff

count next >= din

0[9:0] SL=NA, F0=2

0[9:0] SL= NA, F0=2

0[10:0] SL=N(A, F0=1

tmp0 i

RTL\_ADD tmpl i

RTL SUB

RTL\_SUB

dout\_nxt0\_i\_1

SL=N(A, FO=7 1

SL=N/A, F0=7 1

SL=N(A, FO=3 10[9:0]

SL=N(A, F0=7 10[10:0]

SL=NA, F0=4 |1[10:0]

SL=NA, FO=3 10[9:0]

Device : xcvc1902-vsva2197-1LP-i-S Frequency : 500 MHz

req [1:0] sel dly; reg [9:0] din dly; reg [9:0] count next; req [9:0] count; reg [10:0] dout nxt; always@(\*) begin case(sel dly) 2'bl0 : count next = count + 1; 2'b01 : count next = count - 1; default : count next = count; endcase end wire [9:0] tmp0; wire [9:0] tmp1; assign tmp0 = count + 1; assign tmpl = count - 1; always@(\*) begin case(sel dlv) 2'bl0 : dout nxt = {1'bl,tmp0} - {1'b0,din dly}; 2'b01 : dout nxt = {1'b1,tmp1} - {1'b0,din dly}; default : dout nxt = {l'bl,count} - {l'b0,din dly}; endcase end always@(posedge clk) begin sel dly <= sel;

din\_dly <= set; din\_dly <= din; count <= count\_next; dout <= dout\_nxt[10]; end

| Version  | LUTs | FFs | WNS       | Critical Path                                |
|----------|------|-----|-----------|----------------------------------------------|
| Original | 25   | 23  | -0.501 ns | FF -> LOOKAHEAD -> LUT -> 2 LOOKAHEADs -> FF |
| Modified | 42   | 23  | 0.311 ns  | FF -> 2 LUTS -> LOOKAHEADs -> FF             |

count + 1 >= din

 $count - 1 \ge din$ 

SL=NA, F0=1, S=2'b10 10(10:0)

SL=N(A, FO=1, S=2'b01 |1[10:0]

SL=NA, F0=1, S=default [2[10:0]

count >= din

dout nxt0 i

RTL SUB

RTL SUB

sel\_dly\_reg[1:0]

RTL REG

dout nxt0 i 0

0[10:0] SL=N(A, F0=1

0[10:0] SL=N(A, F0=1

SL=N(A, F0=3

SL=N(A, F0=7 10[10:0]

SL=N(A, F0=4 |1[10:0]

SL=N(A, F0=7 [0[10:0]

SL=N(A, F0=4 |1[10:0]

SL= NA, F0= 23

SL=N/A, F0=1



dout nxt i

s(1:6)

0[10:0] SL=NA, FO=1

RTL MUX

dout rea

RTL REG

SL=NA, F0=3

SL=N/A, F0=23

SLENIA, FOE1

count\_reg[9:0]

RTL\_REG

din dly reg[9:0

RTL\_REG

NA TOUT

9L=NIA, FO=3

SL=N(A, F0=23

SL=NA, FO=10 CE

SL=NA, FO=1

SL=NA, FO=23

SL=NIA, FO=1

## **Key Synthesis Takeaways**

New features help you use device resources more efficiently

- Logic Compaction for low-precision arithmetic
- RAM\_STYLE = mixed for heterogeneous RAM mapping

Remember key differences when migrating to Versal

- Versal has more UltraRAM capabilities, fewer BRAM options
- Versal DSP block natively supports complex multiplication and dot products

To improve critical paths, look at the Elaborated Design and think of ways to improve the data flow



# Key Implementation Features (2020.X)



## Pblocks are Treated as Soft By Default

Soft Pblocks have been supported since Vivado 2019.1

- In 2020.2 Pblocks are treated as Soft by default and are honored until Physical-Synthesis-In-Placer in Global Placement
- Reduces need to update pblocks when changing design
- Reduces pblock restriction on congestion handling in rest of placer flow
- Allows <u>all physical optimizations (PSIP, phys\_opt\_design)</u>

#### DFX parent & child pblocks = hard by default

- HD.ISOLATED
- HD.RECONFIGURABLE
- HD.TANDEM
- HD.TANDEM IP PBLOCK -
- HD.RECONFIGURABLE CONTAINER -
- User constraint IS\_SOFT=FALSE carried forward in DCPs from previous Vivado releases when loaded in Vivado 2020.2

#### Hard Pblocks X1Y4 X2Y4 X1Y3 X2Y3 X1Y2 X2Y2 WNS TNS impl hard pblocks (active) -1.13400:12:06 ✓ impl soft pblocks 0.192 0.000 00:08:01 64 Soft Pblocks X2Y3. X3Y3



✓ ✓ synth 1

## Physical-Synthesis-In-Placer (PSIP) Improvements

#### Equivalent Driver Re-wire Optimization

- Loads are redistributed between logically-equivalent drivers based on their placements
- Helps reduce routing resource utilization and congestion
- After rewiring it is possible that some inputs of a LUT are connected to the same net and LUT reduction can result



Summary of Physical Synthesis Optimizations

|                            | I  | WNS Gain (ns) | I | TNS Gain (ns) | 1  | Added Cells | I | Removed Cells | Optimized Cells/Nets | I | Dont |
|----------------------------|----|---------------|---|---------------|----|-------------|---|---------------|----------------------|---|------|
| LUT Combining              | I. | 0.000         | 1 | 6.524         | i. | 0           | T | 2255          | 2255                 | T |      |
| Equivalent Driver Rewiring | 1  | 0.000         | 1 | 1120.810      | 1  | 2185        | 1 | 4271          | 121                  | 1 |      |





After

3 1



| 3 | XIL | INX. |
|---|-----|------|

00:00:07

00:00:55 |

## **PSIP** Replication Properties

- MAX\_FANOUT\_MODE and FORCE\_MAX\_FANOUT allow user to direct replication in PSIP
  - Works for FF and LUT
- For replication of drivers with far-apart loads
  - MAX\_FANOUT\_MODE values
    - MACRO (Block RAM, UltraRAM, DSP)
    - CLOCK\_REGION
    - SLR



FORCE\_MAX\_FANOUT = 1 MAX\_FANOUT\_MODE = MACRO



#### MAX\_PROG\_DELAY Capped For SLR Crossing Performance

#### Placer limits MAX\_PROG\_DELAY for UltraScale+ devices

- Minimizes clock skew on SLR crossing when balancing clock network delays
- USER\_MAX\_PROG\_DELAY property allows user to cap delays further if required

#### Clock Utilization Report shows programmable delays used for each clock

15. Device Cell Placement Summary for Global Clock g6

|    |        |             |           |             |              | ↓<br>  Waveform (ns) |          |        |               |
|----|--------|-------------|-----------|-------------|--------------|----------------------|----------|--------|---------------|
| ĝ6 | BUFGCE | E/0         | X4Y11     |             |              | {0.000 2.463}        | -        |        | 86098         |
|    |        |             |           | l count of  | clock pecoup | en (alabal eles      | l. huffe | - 10/  |               |
|    |        |             |           | ell count o |              | es (grobar croc      | K DUTTE  | г, мм. | CM, PLL, etc) |
|    |        | n represent | s load ce | ell count o | ++           | NRIZONTAL PROG       | <b>+</b> |        | M, PLL, etc)  |

|     | XO   | Xl  | X2   | ХЗ    | X4     | X5   | X6    | X7   | HORIZONTAL PROG DELAY |
|-----|------|-----|------|-------|--------|------|-------|------|-----------------------|
| Y15 | 0    | Ō   | 3459 | 1930  | 76     | 4319 | 7775  | 6087 | 0                     |
| Y14 | 4    | 0   | 15   | 14    | 8      | 2828 | 11153 | 5112 | 0                     |
| Y13 | 357  | 212 | 0    | 34    | 228    | 6990 | 10597 | 1025 | 1                     |
| Y12 | 22   | 107 | 0    | 0     | 14     | 5441 | 5802  | 428  | 2                     |
| Y11 | 0    | 346 | 71   | 1     | (D) 22 | 572  | 744   | 200  | 3                     |
| Y10 | 0    | 165 | 328  | 0     | 0      | 2    | 0     | 0    | 4                     |
| Y9  | 0    | 129 | 390  | 0     | Θ      | 0    | 0     | 0    | 5                     |
| Y8  | 0    | 238 | 296  | (R) 0 | 0      | 0    | 0     | 0    | 5                     |
| Y7  | 0    | 556 | 78   | 2     | 1      | 0    | 0     | 0 0  | 4                     |
| Υ6  | 16   | 469 | 8    | 0     | 0      | 0    | 0     | 0    | 3                     |
| Y5  | 185  | 526 | 0    | 0     | 0      | 0    | 0     | 0    | 2                     |
| Y4  | 362  | 76  | 0    | 0     | 0      | 0    | 0     | 0    | 1                     |
| Y3  | 506  | 3   | 0    | 0     | Θ      | 0    | 0     | 0    | 0                     |
| Y2  | 1621 | 7   | 28   | 0     | 0      | 0    | 0     | 0    | 0                     |
| Y1  | 3498 | 586 | 29   | 0     | 0      | 0    | 0     | 0    | 0                     |
| YO  | 0    | 0   | 0    | 0     | 0      | 0    | 0     | 0    | 0                     |



**E** XILINX

© Copyright 2021 Xilinx



# **Versal Implementation Guidelines**



## **Versal Changes to Fabric**

#### Versal has a Uniform fabric

- Half of LUTs in every CLB are LUTRAM/SRL capable
- Even Block RAM & UltraRAM distribution
- Simplified CLB architecture with fast LUT cascade
  - No F7/F8/F9 muxes
  - CARRY8 replaced by LOOKAHEAD8 and LUTCY
  - Fast LUT cascade
  - More LUT combining options
- To take full advantage of the architectural changes need to re-synthesize
  - Remove instantiated legacy primitives and synthesis attributes
  - Re-targeting prior architecture netlist will result in sub-optimal implementation

## **Comparing CARRY8 vs. LOOKAHEAD8/LUTCY**

- UltraScale+ uses 8 logic levels, 6 routes with CARRY8s
  - Datapath delay = 3.822 ns



- Versal uses 10 logic levels but still only 6 routes with LOOKAHEAD8s
  - Datapath delay = 3.635 ns



wr\_ptr <= resize(do1(5 downto 3) \* unsigned(img\_size\_x), log2(C\_MAX\_LINE\_WIDTH\*C\_NUM\_LINES)) +
resize(do1(15 downto 6) \* 8, log2(C\_MAX\_LINE\_WIDTH\*C\_NUM\_LINES));</pre>

Above results require re-synthesis of RTL



© Copyright 2021 Xilinx

#### **Versal Multi-Clock Buffer - MBUFG**

Versal supports a new Multi-Clock Buffer (MBUFG) that generates up to 4 output clocks from a single input clock

- Output clocks are /1, /2, /4, and /8 versions of input clock
- MBUFG versions exist for BUFGCE, BUFGCE\_DIV, BUFG\_PS, BUFG\_GT and BUFGCTRL
- ▶ MBUFG is a logical buffer with 4 outputs (O1, O2, O3, O4)
  - Physical implementation uses BUFG\* and BUFDIV\_LEAF leaf clock dividers
    - BUFDIV\_LEAF buffers are driven by horizontal clock distribution and are the final clock buffer for fabric loads (CLB, DSP, BRAM, URAM) and most hard-IP blocks (GTYP\_QUAD, MRMAC, etc.)



## **Versal MBUFG For Synchronous CDC**

#### MBUFGCE

- Common node closer to path
  - Inter Clock Skew ~ 0.174ns
  - Inter Clock FMAX > 600 MHz

#### Parallel BUFGCE

- Common node at driver
  - Can be far away if driver in XPIO clock region or GT Column
- Inter Clock Skew > 0.500ns
- Inter Clock FMAX < 500 MHz



**EXILINX** 

## **MBUFG Transform In Logical Optimization Phase**

opt\_design -mbufg\_opt for global transformation of parallel BUFG -> MBUFG

- MBUFG\_GROUP property allows for targeted transformation
  - Set precedents over which BUFG\* should get converted to MBUFG\*

set\_property MBUFG\_GROUP group1 [get\_nets -of [get\_pins {u\_buf0a/O u\_buf1a/O u\_buf2a/O u\_buf3a/O }]]

Transformations are prevented if timing constraints could result in mismatch



## **Clocking Wizard Support for MBUFG**

 $\mathbf{\lambda}$ Clocking Wizard (1.0) Ocumentation = IP Location C Switch to Defaults Component Name clk wizard 0 IP Symbol Resource Inferred MBUFGCE Show disabled ports Clocking Features Output Clocks MMCM Settings Optional Ports Summary The phase is calculated relative to clk\_out1 1. Select Output Output Freq (MHz) Phase (Degrees) Duty Cycle (%) Output Clock Port Name Drives Clock Grouping PI Control Requested Actual Requested Actual Requested Actual Frequencies that clk out1 clk\_out1 400.000 400.00000 0.000 50.000 Buffer with CE Auto None 0.000 50.0 Clk\_out2 clk\_out2 Buffer with CE 150.000 150.00000 0.000 0.000 50.000 50.0 👻 Auto None are /1, /2, /4, /8 Clk out3 clk out3/4 100.000 100.00000 0.000 0.000 50.000 50.0 Buffer with CE Auto None reset 2. Select "Buffer" or clk\_out4/8 50.000 Buffer with CE 🖌 clk out4 Auto None 50.00000 0.000 0.000 50.000 50.0 ~ locked 🗕 clk1 clr n clk\_out5/2 200.000 Clk out5 Buffer with CE Auto None 200.00000 0.000 0.000 50.000 50.0 clk\_out1 "Buffer with CE" clk1 ce clk\_out6 600.000 MBUFGCE 🖌 clk out6 600.00000 0.000 0.000 50.000 50.0 Auto None clk out2 clk out7 clk out7 100.000 N/A 0.000 N/A 50.000 N/A BUFG Auto None 🗕 clk in1 as clock driver clk out3 Calculate Actual values clk out1 ce clk out4 clk out2 ce clk out5 Phase Shift Mode ZHOLD Settings clk out3 ce Instantiated MBUFGCE clk out6 o1 ○ WAVEFORM ● LATENCY clk out4 ce clk out6 o2 🗖 Select "MBUFGCF" 1 clk out5 ce clk out6 o3 **CE TYPE for BUFGCE / MBUFGCE** CE TYPE for BUFGCE DIV clk out6 ce as clock driver clk out6 o4 🗖 CE and CLR SYNC CIRCUIT EXTERNAL TO CORE ● SYNC ○ ASYNC ○ HARDSYNC ● SYNC ● HARDSYNC clk out6 clr n Deskew Network Enable Deskewl Delay Enable Deskew2 Delay



ОK

Cancel

## **Real World Example Of QoR Improvement with MBUFG**

#### • WNS went from -1.737ns to timing closed!

- impl\_1\_AIE2PLFP WNS=-1.737ns
  - Default strategy implementation
- impl\_2 WNS=0.024ns
  - Default strategy implementation
  - MBUFG transform using MBUFG\_GROUP constraint

| Name              | Constraints | Status                 | WNS   | TNS   | WHS   | THS   |
|-------------------|-------------|------------------------|-------|-------|-------|-------|
| ∨ 🔸 synth_1       | constrs_1   | Synthesis Out-of-date  |       |       |       |       |
|                   |             |                        |       |       |       |       |
| ✓ impl_2 (active) | constrs_1   | route_design Complete! | 0.024 | 0.000 | 0.032 | 0.000 |

### **Top Takeaways**

Run methodology reports, review and fix critical violations and warnings

- Synthesis has many options to drive improved results. In addition, you can develop your own bag of tricks to fine tune critical logic
- Vivado placement has very comprehensive replication to improve QoR, both automatic and user-driven
- Versal architecture brings many improvements over prior architectures, be aware of key differences
  - Re-synthesize for optimal results, and recode if necessary
  - Take advantage of new capabilities like MBUFG, URAM initialization, DSP complex and dotproduct modes

## Where to Find More Information

- User Guides on xilinx.com
  - UG901 Synthesis
  - UG904 Implementation
  - UG906 Design Analysis & Closure
  - UG949 UltraFast Methodology

- Xilinx Community Forums
  - Vivado RTL Development
  - Blogs: Design and Debug Techniques

| nation                     | Solutions Products Support                                                                                                                                                          |                | 1 1          |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|--------------|
|                            | Site Keyword Search                                                                                                                                                                 |                |              |
| ▷ Vivado RTL [             | ×         ug906           Result Type         O           Forums         261           Document         28           Answer Record         5           Image: Development (8 Items) | s and Closure  | Jun 10, 2    |
| Simulatio<br>Discuss topio | on and Verification<br>cs involving simulation and verification tools and flows, including XSIM<br>ulator™, 3rd party simulators, and formal verification.                          | 29110<br>Posts | 29092<br>NEW |
|                            | S<br>s involving HDL synthesis tools and practices, including Vivado™<br>ST™, 3rd party synthesis tools, HDL coding practices and tips.                                             | 33986<br>Posts | 33912<br>NEW |
|                            | cs involving design implementation tools and practices, including<br>plementation, Translate, Map, Place and Route, SmartXplorer, and                                               | 26253<br>posts | 26229<br>NEW |
| Design                     | and Debug Techniques Blog                                                                                                                                                           |                |              |
|                            | Ims > Blogs > Design and Debug Techniques Blog                                                                                                                                      | 1              |              |
| Copyright 2021 Xilinx      |                                                                                                                                                                                     | XIL X          | .INX.        |

# **XILINX**®

# **Thank You**



© Copyright 2021 Xilinx