# SAINT-S: 3D SRAM Stacking Solution based on 7nm TSV technology

Kyoungsun Cho, Jinhong Park, Billy Koo, Sunkyoung Seo, Yoonjae Hwang, Sungcheol Park and Mijung Noh Samsung Electronics 1 July 2020

## [Abstract]

Data movement of emerging devices hit the memory wall. AI devices require higher ⊕ memory bandwidth and higher memory capacity. AR/VR devices also require the lowest latency possible. To fulfill the memory requirements, Samsung Foundry proposed a three-dimension (3D) static random access memory (SRAM) stacking solution (SAINT-S, Samsung Advanced INterconnection Technology with SRAM). The solution is implemented by Samsung 7LPP process and leverages stacking technology using TSV (through silicon via) to achieve high bandwidth and low latency interface between logic die and SRAM die, and small form factor. Implementation results present read/write memory latency of 7.2/2.6 ns and 24.3 GB/s memory bandwidth with 0.156W average power consumption per channel at 760MHz frequency. The memory bandwidth per power of SAINT-S is 6.2x higher than GDDR6's and 2.2x higher than HBM2e.

# [Motivation]

#### AI devices require higher memory bandwidth and low latency with small form factor.



To fulfill the memory requirements, Samsung Foundry proposed a three-dimension
 (3D) static random access memory (SRAM) stacking solution (SAINT-S, Samsung
 Advanced INterconnection Technology with SRAM).

Confidential

## [Architecture]

- Logic die and SRAM die are combined using TSV
- Logic Die includes custom 3D SRAM controllers as well as
   CPU and DMA to verify performance and power
- 3D SRAM Controller
  - Access to SRAM (64Mbits) on SRAM die through TSV
    - 256bit @ 760MHz per channel
  - Source synchronous interface and asynchronous FIFO
  - Double data rate (DDR) conversion to reduce the number of signals
  - Controllable delay lines to compensate the clock to data and data to data skew



## [Horizontal Design View]

#### Source Synchronous DDR Interface







#### Low Power Features for TSV IO

|          | Current work                                                                                                                                                                                   | Future work                                                                             |
|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
|          | 0.07pj/bit                                                                                                                                                                                     | 0.05pJ/bit                                                                              |
| Driver   | <ul> <li>Eliminating the on-die termination (ODT)</li> <li>Minimizing the # of blocks using VDDQ power</li> <li>Using source termination in main driver to improve signal integrity</li> </ul> | <ul> <li>Un-terminated low voltage swing signaling</li> </ul>                           |
| Receiver | 0.03pJ/bit                                                                                                                                                                                     | 0.02pJ/bit                                                                              |
|          | <ul> <li>Using the only core logic power (VDD) and it is designed to minimize the # of stages</li> <li>No level-shifter is needed in receiver path</li> </ul>                                  | <ul> <li>CMOS base receiver scheme instead of<br/>single-ended type receiver</li> </ul> |

## [Vertical Design View]

#### Any die on the stack can be a master by MASTER\_SEL



## [Package Structure]

## CoW Process

## Package Image



- Chip on wafer (CoW) Bonding
- Based on Mass Manufacturing Infra with >1.33 CpK
- Passed JEDEC Package Reliability Standard
- ⊕ Package Flexibility; Multi-memory Stacking (≥ 2H)

Memory Side by Side Bonding Face to Face Bonding



# [Implementation & Design Methodology] 3D IC Interface STA Method

#### Budget-based Flow

- [Constraint Budgeting] Tier1: set delay constraints on register- to-IO input paths
- [Constraint Budgeting] TSV: IO-to-IO Path
- [Constraint Budgeting] Tier2: set delay constraints IO-to-register paths
- Add the constraints and run STA
- Run SPICE simulation using TSV paths and check the constraints
- By running STA of each die concurrently, TAT and resource can be optimized as in a

conventional design with maintaining SPICE-level accuracy of jitter/DCD in TSVs



# [Implementation & Design Methodology] PSI Analysis on TSV Interconnects

### IO decap for Power Noise design guideline at pre-layout stage DCD[%]

- Criteria: below 5% of Duty Cycle
- Insertion guideline: 30nF for enabling 760MHz interface

#### PSI Analysis for 1-stack SRAM at 800MHz

Focusing on only 5-coupled lanes (Total IO: 256ea) means the same

effect to consider all SSO noise from other 251 IO lanes



< Channel simulation & TSV modeling >





9

# [Implementation & Design Methodology] 3D IC IREM Analysis Flow

#### Concurrent multi-die analysis flow

- Showing how hot spots have an effect on the other die concurrently
- According to this result, the power meshes of both logic and SRAM die are reinforced

#### Difference from single chip IREM

- Chip to chip connection
- TSV modeling





# [Implementation & Design Methodology] 3D-IC Auto P&R Flow

#### 3D-IC P&R Challenges

- New TSV-related rules restrict floor-planning and placement work
- The number of TSVs is more than 1000ea
- Efficient TSV signal/power placement architecture is necessary to minimize design overhead

#### P&R Solutions

Custom scripts for automated placement & routing



## [Results & Future work]

#### Chip Summary

|                      | Specification                                                               |  |
|----------------------|-----------------------------------------------------------------------------|--|
| Process              | 7nm CMOS                                                                    |  |
| Die Size             | 9 x 9 mm <sup>2</sup> (SRAM die) ,<br>9.5 x 9.5 mm <sup>2</sup> (Logic die) |  |
| Package Size         | 12 x 12 mm <sup>2</sup>                                                     |  |
| SRAM Capacity        | 64Mbit per die                                                              |  |
| Supply Voltage       | 0.85V (Logic/SRAM),<br>1.0V (TSVIO),<br>1.8V (GPIO)                         |  |
| Frequency            | ~760MHz                                                                     |  |
| # of channel         | 128bit x 2ch                                                                |  |
| Power<br>consumption | 0.156W (SRAM+TSVIO)                                                         |  |
| Bandwidth            | 24.3GB/s per channel                                                        |  |
| Latency              | 7.2 ns(read), 2.6 ns (write)                                                |  |

#### Performance Comparison

|         | Memory Bandwidth<br>per Power<br>(GB/second/Watt) | Memory Latency (ns) |
|---------|---------------------------------------------------|---------------------|
| GDDR6   | 25                                                | 45                  |
| HBM2e   | 70                                                | (70% column hit)    |
| SAINT-S | 156.1                                             | Read 7.2/Write 2.6  |







**SAINT-S Test board** 

# Thank You

samsung foundry

