# Baidu Kunlun An Al processor for diversified workloads

Jian Ouyang, <sup>1</sup> ( ouyangjian@baidu.com ) Mijung Noh<sup>2</sup>, Yong Wang<sup>1</sup>, Wei Qi<sup>1</sup>, Yin Ma<sup>1</sup>, Canghai Gu<sup>1</sup>, SoonGon Kim<sup>2</sup>, Ki-il Hong<sup>2</sup>, Wang-Keun Bae<sup>2</sup>, Zhibiao Zhao<sup>1</sup>, Jing Wang<sup>1</sup>, Peng Wu<sup>1</sup>, Xiaozhang Gong<sup>1</sup>, Jiaxin Shi<sup>1</sup>, Hefei Zhu<sup>1</sup>, Xueliang Du<sup>1</sup>

<sup>1</sup>Baidu, Inc. <sup>2</sup>Foundry Business, Samsung Electronics





### The diversified AI applications





#### The diversified AI scenarios





## Design AI chip products from industry perspectives

- Target at mainstream market
- Try to explore market volume as much as possible
- Need to support AI applications and scenarios as many as possible



## But, the challenge

- Large variety of computing and memory accessing patterns
  - Up to thousand operators in mainstream frameworks
  - Mix of tensor, vector and scalar operations
  - With sequential and random memory access
- Rapid change in algorithm and applications
- Developers have high threshold to new hardware



## Baidu Kunlun's product vision

- Large variety of computing and memory accessing patterns
- Rapid change in algorithm and applications
- The high threshold of developers to new hardware

- Generic
- Flexibility
- Usability and programmability
- High performance



## The history of Baidu Kunlun



- Move from FPGA to ASIC
- Evolve from full customization to full programmability
- SDA: software-define Accelerator

- XPU: the X processor unit for diversified workloads
- Baidu Kunlun: the name of Baidu first AI chip, Kunlun is the famous mountain in China



## The overview of Baidu Kunlun



- Samsung Foundry 14nm , 2.5D PKG
- 2 x HBM , 512GB/s
- PCIE 4.0 x 8
- 150W , 256Tops



# The overview of Baidu Kunlun board

| Model                | Baidu Kunlun K200 |
|----------------------|-------------------|
| Architecture         | XPU               |
| Precision            | INT4/8            |
|                      | FP32<br>INT/FP16  |
| Computing capability | INT8: 256TOPS     |
|                      | INT/FP16: 64TOPS  |
|                      | INT/FP32: 16TOPS  |
| HBM Memory Size      | 16GB              |
| HBM Bandwidth        | 512GB/s           |
| Host IF              | PCIE Gen4.0 * 8   |
| Processing           | 14nm              |
| Thermal Cooling      | Passive           |
| Package              | 2.5D              |
| TDP                  | 150W              |





# The overview of Baidu Kunlun architecture



- XPU v1, FPGA based : Hotchips 2017
- Customized logic for tensor and vector
- Tiny cores for scalar



- XPU v2
- With the same design methodology
- More powerful than FPGA version



# The overview of Baidu Kunlun architecture



- Two units, each unit has
  - 8GB HBM, 256GB/s
  - 16MB on-chip memory
  - 4 XPU-SDNN and 4 XPU-Cluster
- XPU-SDNN
  - Software-defined Neural Network
    engine
  - Aims at tensor and vector
- XPU-Cluster
  - Aims at scalar and vector
  - With SIMD Instructions
  - 16 tiny core in one cluster



# The overview of Baidu Kunlun software stack



- Support multiple frameworks with graph compiler
  - Paddle Paddle, Tensorflow, Pytorch
- Support new operators by userwritten kernels
  - XPU C/C++ programming language
- Deep learning library
  - APIs for common operators used in deep learning network



#### Inference performance – micro benchmark





### Inference performance – YoloV3

QPS: queries per second



- YoloV3 darknet53, 608
- Baidu Kunlun: int16; T4 : TensorRT-FP16. Both accuracy are the same as FP32
- The accuracy of tensorRT-int8 is 5% ~8% less than FP32. so we use FP16/int16 as benchmark



## Inference performance – BERT



QPS: queries per second

Bert\_Base\_Uncased: •

12 layer, heads\_num = 12, hidden\_size = 768, sequence length = 128

• GPU: TensorRT-FP16; Kunlun: Int16



### Inference performance – real models in search engine



Notes: model1 and model3 are NLP models. Model2 is vision model



## Inference performance – customized MaskRCNN

QPS: queries per second



- CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 10.0 cuDNN Version: 7.5
- Input size : 920x1120



- K200 was used in a customized machine for smart industry
- Running a series of models including
  MaskRCNN

## Conclusion

- Baidu Kunlun is an AI processor for diversified workloads
  - 256Tops int8 and 64Tops int16/fp16
  - 512GB/s memory bandwidth
  - Samsung Foundry 14nm processing, TDP 150W
- Proven in real applications
  - Large collection of models: NLP, vision, speech and etc.
  - Wide ranging scenarios from data center to big edge
- It is available now!
  - Can be accessed via Baidu Cloud

