Skinny & Tall Matrix Multiplication이 Memory-Bound인 이유

4 min readJun 6, 2021

Skinny & Tall Matrix Multiplication이 많을수록 딥러닝 알고리즘의 성능에 영향을 미친다..

GEMM (GEneral Matrix Multiplication)

선형대수 커널은 scientific simulation, big data analytics, machine learning에 널리 사용된다. 그 중에서 GEMM 연산이 가장 핵심이다. GEMM 연산은 GEMM의 입력 모양에 따라 Computation bound 연산과 Memory bound 연산으로 나눌 수 있다.

Computation bound 연산: 응용의 성능이 computation power에 제한됨
Memory bound 연산: 응용의 성능이 memory bandwidth에 제한됨

GEMM 연산은 일반적으로 Computation bound 연산이다. 하지만 GEMM 연산 중 다음과 같은 두가지 조건에서 bandwidth bound로 인해 Memory bound 연산으로 분류되며 Memory bound 연산에 의해 CPU/GPU Utilization가 낮다.

(1) 입력이 tall & skinny 형태의 Matrix Multiplication(n >> k, n > 10000, k <100)인 경우
(2) Large Input Matrix가 메인 메모리에 있는 경우

일반적으로 데이터들은 메모리 상에 row-based로 저장된다. 하지만 Tall & skinny matrix는 낮은 locality 때문에 필요한 데이터를 cache에서 발견할 확률이 떨어지기 때문에 TSMM(Tall & Skinny Matrix Multiplication)은 memory-bound 연산이다.

TSMM(Tall & Skinny Matrix Multiplication)

데이터센터에서 DL 추론 태스크는 다음과 같은 조건을 모두 충족하는 경우가 많아 TSMM인 경우 많다.

DL 추론 쿼리는 small batches를 필요로하는 엄격한 지연시간 제약이 있다. FC 레이어의 해당 GEMM 연산은 large weight 행렬과 small input 행렬의 곱셈으로 이루어져 있다.
MLP weight가 메인 메모리에서만 발견된다. 1) MLP 매개 변수의 총 크기가 캐시 용량을 초과하거나, 2) Single 노드에 여러 모델이 함께 공존하기 때문이다. (Single Node에 시스템 효율성을 개선하고 멀티 모델 쿼리 latency를 줄이기 위해 멀티 모델을 같이 위치시키는 것이 보통이다.)

레퍼런스

[1] Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are…

arxiv.org

[2] Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs

Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in…

arxiv.org

[3] TSM2: Optimizing Tall-and-Skinny MatrixMatrix Multiplication on GPUs

TSM2X: High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on GPUs

Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been…

arxiv.org

Skinny & Tall Matrix Multiplication이 Memory-Bound인 이유

GEMM (GEneral Matrix Multiplication)

TSMM(Tall & Skinny Matrix Multiplication)

레퍼런스

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are…

Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in…

TSM2X: High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on GPUs

Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been…

Written by daewoo kim

No responses yet