

[12 Nov 2020] • [AoE]



**PMBS20 Workshop** 





## Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

Christie Alappat, Jan Laukemann, Thomas Gruber, Georg Hager, Gerhard Wellein, Nils Meyer, Tilo Wettig



Universität Regensburg



### NEWS

**A64FX** 

# Japan Captures TOP500 Crown with Arm-Powered Supercomputer

June 22, 2020

FRANKFURT, Germany; BERKELEY, Calif.; and KNOXVILLE, Tenn.—The 55th edition of the TOP500 saw some significant additions to the list, spearheaded by a new number one system from Japan. The latest rankings also reflect a steady growth in aggregate performance and power efficiency.



The new top system, Fugaku, turned in a High Performance Linpack (HPL) result of 415.5 petaflops, besting the now secondplace Summit system by a factor of 2.8x. Fugaku, is powered by Fujitsu's 48-core A64FX SoC, becoming the first number one system on the list to be powered by ARM processors. In single or further reduced precision, which are often used in machine learning and AI applications, Fugaku's peak performance is over 1,000 petaflops (1 exaflops). The new system is installed at RIKEN Center for Computational Science (R-CCS) in Kobe, Japan.

#### Source : https://www.top500.org/news/japan-captures-top500-crown-arm-powered-supercomputer/





The List







Universität Regensburg



per node







Ρ Ρ Ρ Ρ Ρ Ρ Ρ Ρ Ρ Ρ Ρ Ρ L1D || L1D || L1D **L2 Memory Interface** Memory

#### 1 CMG









#### Core

- Clock : 1.8 GHz
- Instruction set : Armv8.2-A+SVE
- Maximum VL : 512 bit (8 double)

For example on GCC compiler use :

-msve-vector-bits=512 -march=armv8.2-a+sve

#### 1 CMG









1 CMG

#### L1D cache

- Size : 64 KiB
- Topology : Private cache
- Cache line size : 256 bytes









1 CMG

#### L2 cache

- Size : 8 MiB
- Topology : Shared within 1 CMG
- Cache line size : 256 bytes









1 CMG

#### Main memory

- Size : 4 x 8 GiB
- Type : HBM2



#### **Motivation**





dwidth [Gbyte/s]



Clear memory bandwidth saturation for STREAM TRIAD (a[i] = b[i] + s\*c[i]).

But why not for SUM (s += a[i]) and SpMV (b = Ax)?



#### **Motivation**





Universität Regensburg



Thread pinning : Compact









Execution-Cache-Memory (ECM) model helps us to understand and analyze the single-core performance.









Execution-Cache-Memory (ECM) model helps us to understand and analyze the single-core performance.

3 major components :

1) In-core









Execution-Cache-Memory (ECM) model helps us to understand and analyze the single-core performance.

3 major components :

1) In-core

2) Data transfer through memory hierarchies









Can these transfers be overlapped or not ?

Execution-Cache-Memory (ECM) model helps us to understand and analyze the single-core performance.

3 major components :

1) In-core

- 2) Data transfer through memory hierarchies
- 3) Overlap hypothesis



#### ECM model $\rightarrow$ In-core





L1I Cache 64 KiB, 4-Way 32 Bytes/cy Instruction Buffer (48; 8x6 entries) MOP MOP MOP MOP MOP MOP 4-Way Decode from/to L1 μ<mark>Ο</mark>Ρ μ<mark>Ο</mark>Ρ μ<mark>Ο</mark>Ρ μ<mark>Ο</mark>Ρ **Reservation Stations** RSE0 RSE1 RSA0 RSA1 **RSBR** 20 entries 19 entries 20 entries 10 entries 10 entries EAGA EAGB FLA PR FLB EXA EXB BR μΟΡ μΟΡ μΟΡ μΟΡ Int ALL FP arith AGU AGU FMA FP DIV VEC addr. cal **Execution** Units FP ST Store Port: Fetch Port: GPR: 32 (96 physical) Register file FPR: 32 (128 physical) 192 entries (24 physical) 160 entries (40 physical) PPR: 16 (48 physical) 64 B/cy 64 B/cy 64 B/cy L1D Cache

<sup>1</sup>horizontal recursive add

#### 🗱 🗶 🗶

#### ECM model $\rightarrow$ In-core





Universität Regensburg

#### **Reservation Stations** RSE0 **RSBR** RSE1 RSA0 RSA1 20 entries 20 entries 10 entries 10 entries 19 entries EAGB BR µOP FLA PR FLB EXA EXB EAGA μ<mark>Ο</mark>Ρ μ<mark>Ο</mark>Ρ μ<mark>Ο</mark>Ρ μ<mark>Ο</mark>Ρ μ<mark>Ο</mark>Ρ μ<mark>Ο</mark>Ρ μΟΡ BR Predicate manipul. Int ALU Int ALU Int ALU LD LD Int ALU AGU AGU FP arith. FP arith. MUL DIV FMA Int ST FMA FP DIV VEC addr. calc **Execution Units** FP ST 1 cy FMA LD LD ST 2 cy



a[i] = b[i] + s \* c[i]

| .L18:                               |
|-------------------------------------|
| ld1d z4.d, p5/z, [x21, x9, lsl 3]   |
| ld1d z5.d, p5/z, [x20, x9, lsl 3]   |
| fmad z5.d, p5/m, z2.d, z4.d         |
| st1d z5.d, p5, [x19, x9, lsl 3]     |
| add x8, x9, <mark>8</mark>          |
| whilelo p5 <mark>.d</mark> , w8, w7 |
| b.any .L18                          |





#### ECM model → Memory hierarchy





Universität Regensburg

#### Machine model FX700





#### ECM model → Memory hierarchy



Universität Regensburg









#### ECM model $\rightarrow$ Memory hierarchy







TRIAD a[i] = b[i] + s\*c[i]









#### ECM model → Memory hierarchy + In-core





Machine model **FX700** Registers 64 B/cy 128 B/cy L1 64 B/cy 32 B/cy L2 117 B/cy 64 B/cy MEM

Application model TRIAD a[i] = b[i] + s\*c[i]



ECM prediction TRIAD on FX700





How do these boxes overlap?





ECM prediction TRIAD on FX700







Universität Regensburg

#### Hypothesis 1 : No overlap









Hypothesis 2 : Full overlap









Hypothesis 3 : Full overlap + half-duplex









#### Hypothesis 4 : L1L2 overlap + half-duplex at MEM



Compare measurements with predictions.







There are numerous combinations.

How do we find the correct one?

Compare measurements with predictions.



The best hypothesis for FX700



#### ECM model → Overlap hypothesis





There are numerous combinations. How do we find the correct one?

Compare measurements with predictions.



A systematic way of identifying overlap hypothesis is presented in : Hofmann et.al., 2020, Bridging The Architecture Gap: Abstracting Performance-relevant Properties Of Modern Server Processors, https://doi.org/10.14529/jsfi200204

#### The best hypothesis for FX700



#### ECM model $\rightarrow$ Insights





#### Unrolling plays an important role



Unrolling factor=8









Sparse Matrix-Vector Multiplication (SpMV): b=Ax



In Compressed Row Storage (CRS) format

```
for i = 0:nrows-1 //Long outer-loop
for j = row_ptr[i]:row_ptr[i+1]-1 // Short inner-loop
b[i] = b[i] + A[j] * x[col_idx[j]]
```







Assembly of the short inner-loop







Assembly of the short inner-loop

| .L6: | ld1sw<br>ld1d<br>ld1d<br>add<br>fmla<br>whilelo<br>b.any | <pre>z0.d, p0/z, [x17, x20, lsl 2]<br/>z2.d, p0/z, [x18, x20, lsl 3]<br/>z3.d, p0/z, [x30, z0.d, lsl 3]<br/>x20, x20, 8<br/>z1.d, p0/m, z3.d, z2.d<br/>p0.d, x20, x14<br/>.L6</pre> |
|------|----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|      | faddv                                                    | d4, p1, z1.d                                                                                                                                                                        |



In Compressed Row Storage (CRS) format







Assembly of the short inner-loop





#### SpMV -> SELL-C-σ





Vector-friendly SpMV data format/kernel required for A64FX  $\rightarrow$  SELL-C- $\sigma^1$ 



#### Benefits:

- Vectorization and unrolling along chunk size (C)  $\rightarrow$  long loop and tunable
- No costly horizontal-add (faddy) <sup>1</sup>M. Kreutzer et al., A Unified Sparse Matrix Data Format For Efficient General Sparse Matrix-vector Multiplication On Modern Processors With Welle Sinfliguenc, SIAM SISC 2014, DOI: 10.1137/130930352





Universität Regensburg



= 3.4 Gflops/s  $\approx$  22 Gbytes/s





Universität Regensburg

Can we saturate now ?



HPCG matrix, dimension 128<sup>3</sup>





Universität Regensburg

Can we saturate now ? Yes, but needs almost all cores







Universität Regensburg

Can we saturate now ?









Universität Regensburg











Matrices from SuiteSparse Matrix Collection : https://suitesparse-collection-website.herokuapp.com



#### Conclusion





- High single core performance is crucial.
- ECM model was established and utilized to analyze the single core performance.
- The partial overlapping memory hierarchy allows for high single-core memory bandwidth.
- Proper single core optimizations have to be done to hide long floating point latency and inefficiencies in OoO.
- For SpMV we were able to saturate the bandwidth with SELL-C- $\sigma$  format.

Everywhere We Are #morethanhpc.



# Thank you

# Questions ?

