G TC 2022 Session별 highlights — Data Center: Networking, Virtualization, Cloud

daewoo kim

13 min readApr 16, 2022

이번 GTC 2022에 총 1000여 개의 Session이 발표되었다. 이중에서 주요 topic별로 highlight를 요약하였다. 그 두번째로 LLM 학습에 대해 살펴본다.

1.LLM(Large Language Model) 학습

2.Data Center: Networking/Virtualization/Cloud

3. HPC: Supercomputing

1. Accelerate your AI Journey on Google Cloud & NVIDIA

[Google Cloud의 경쟁력]

Google Cloud는 AI infrastructure 와 Cloud AI Developer Service를 리드하고 있다.

[Cloud AI & Industry Solution Portfolio]

[Google Cloud NVIDIA GPU Portfolio]

[A2 VM family with NVIDIA A100s]

Scale up: 16 A100–40GB GPUs via 2 NVLink inter-connected HGX boards
Workload Optimized: NVLink 3.0 All-to-All GPU communication with peak BW of 9.6 TBps & 100 Gbps networking with an optimized NCCL
Mega performance: 10 Pflops (FP16) or 20 POps (INT8. when using NVIDIA’s new sparsity feature) in a single VM.

[Vertex AI]

데이터 과학과 ML를 위한 통합 개발 & 배포 플랫폼
ML 엔지니어와 데이터 과학자의 생산성을 높임

[왜 AI/ML에 Google Kubernetes Engine인가?]

2. How To Modernize Your Enterprise Data Center To Be AI-Ready

[NVIDIA End-to-End AI SW Stack]

[NVIDIA AI Enterprise SW]

[NVIDIA RAPIDS]

RAPIDS는 GPU를 이용하면 End-to-End 데이터 과학과 애널리틱스 pipeline이다.
RAPIDS는 저수준 컴퓨팅 최적화를 위해 NVIDIA CUDA primitives를 활용하며, Apache Spark 또는 Dask와 같은 사용자 친화적인 인터페이스를 통해 GPU 병렬 처리화 High-bandwidth Memory를 제공한다.
Spark 또는 Dask를 사용하여, RAPIDS를 다중 노드, 다중 GPU 클러스터로 확장하여 빅 데이터 프로세스를 강화할 수 있다.

[NVIDIA RAPIDS: Lightning-Fast End-To-End Performance]

16x: A100s는 100 CPU 노드보다 더 많은 computing power을 제공
70x: 동일한 CPU 구성보다 더 높은 성능
20x: 동일한 CPU 구성보다 더 비용 효율적임

[NVIDIA TAO Toolkit]

[Tritron Inference Server]

GPUs와 CPUs에서 추론
다양한 query 타입: 실시간, 오프라인 배치, Video/Audio 스트리밍, 앙상블
모델 analyzer는 app 제한사항을 최적화함
분산 Multi-GPU & Multi-Node 추론
Tree기반 모델(e.g., XGBoost, Scikit-learn random forest) 추론용 RAPIDS FIL(Forest Inference Library) 백엔드

[NVIDIA AI Enterprise를 통한 AI 워크로드 제공]

[VMWARE Cloud Director 10.3.2]

[Recommended Accelerators For NVIDIA AI Enterprise]

3. Tuning Visualized GPUs for Optimal Performance on ML/AI Workloads

[Scaling of Training Performance with NVLink Connected vGPUs]

NVLink로 연결된 2개의 GPU가 NVLink가 없는 4개의 GPU를 능가함

vGPUs는 Passthrough/Bare Metal 성능과 동등하거나 약간 더 좋음

[vGPU vs. MIG with vGPU with A100]

(1) vGPU Only

GPU당 vGPU으로 10 VMs까지 허용함
메모리가 vGPU 간에 동등하게 분할됨
Compute가 시분할 기반으로 공유됨

(2) MIG with vGPU

GPU당 7 VM까지 허용함
메모리와 Compute가 vGPU 간에 분할됨

[vGPU vs. MIG with vGPU]

(1) 실험 셋업

Case-1) MIG with vGPU: 1–7 VMs가 동시 실행. 각 VM은 MIG 1–5c profile을 가진 vGPU를 갖음 (1 compute — 5GB 메모리)
Case-2) vGPU only: 1–7 VMs가 동시 실행. 각 VM은 MIG 5c profile을 가진 vGPU를 갖음 (최선의 노력으로 모든 compute를 공유 — 5GB 메모리)
워크로드: MaskRCNN (batchsize = 2)

(2) ML 학습 성능 비교

학습시간: MIG with vGPU가 4%~96%가 더 좋음
Throughput: MIG with vGPU가 7%~128% 더 좋음
경량 load용 VMs per GPU를 증가시켜 ML 학습 성능을 최대화

(3) ML 추론 성능 비교

추론 시간: MIG with vGPU가 1%~119%가 더 좋음
Throughput: MIG with vGPU가 151% 까지 더 좋음
경량 load용 VMs per GPU를 증가시켜 ML 추론 성능을 최대화

[vGPU vs. MIG with vGPU: Sizing the ML Workload]

(1) 실험 셋업 (Light Load vs. Moderate Load vs. Heavy Load)

(2) 워크로드별 ML 학습 성능 비교

Light & Moderate Load: MIG가 더 좋음
Heavy Load: vGPU가 근소하게 더 좋음

(3) 워크로드별 ML 추론 성능 비교

Light & Moderate Load: MIG가 더 좋음
Heavy Load: vGPU가 근소하게 더 좋음

[MIG or vGPU — Summary]

4. Running Cloud-native Apps in NVIDIA AI Enterprise

[NVIDIA End-To_End AI SW Suite]

[Data Pre-processsing]

GPU ML 파이프라인에서 데이터는 GPU를 떠나지 않음
cuDF는 data-wrangling을 위해 pandas-like API를 가지고 있ㅇ므
GPU 상의 tokenzation용 추가적인 헬퍼 함수는 CPU에서 전처리하는 것보다 30x 더 빠름. Dask로 멀티 GPU로 확장 가능함

Static: NV Tabular

Streaming: cuStreamz

[Model Training]

(1) Horovod & MPI

Horovod는 TF, Keras, PyTorch, MXNet을 위한 분산 DL 학습 프레임워크
Horovod의 목적은 분산 딥러닝을 빠르고 쉽게 사용할 수 있도록 만드는 것
MPI (w/ NCCL)은 TCP 소켓 또는 NCCL을 사용하는 인피니밴드를 통해 서로 통신할 수 있는 원격 시스템에서 프로세스를 시작함

(2) Multi-GPU & Multi-node Training

Magnum IO: AI 학습, HPC, 데이터 과학에서 IO 병목을 제거함
NCCL: NVIDIA GPUs와 네트워킹에 최적화된 multi-GPU & multi-node comm. 프리미티브를 구현함
MOFED: multi-node comm.를 위해 인피니밴드와 RoCE를 활성화하는 드라이버
GDRDMA(GPU Direct RDMA): CPU 개입없이 원격 GPU의 주소 공간에 접근
GDS(GPU Direct Storage): CPU 개입없이 Storage로부터 GPU로 직접 데이터를 읽음

(3) Kubernetes operator support for MPI

MPI Operator를 사용하면 Kubernetes에서 all-reduce 스타일의 분산 학습을 쉽게 실행할 수 있음

[Model Inference]

(1) Triton Inference Server

(2) Triton Inference Server on Kubernetes

Triton 추론 서버는 클러스터 안의 kubernetes 서비스로써 배포될 수 있음

Latency vs Batching

Triton은 개별 추론 요청이 입력 batch를 지정할 수 있도록 하여 배치 추론을 지원함
추론 throughput을 크게 증가시킬 수 있으므로 Batch 입력에 대한 추론은GPU에 동시에 수행됨
많은 사용 사례에서 개별 추론 요청은 일괄처리(batch) 되지 않으므로 batching의 throughput 이점은 없음
Triton 추론 서버는 다양한 모델 유형과 사용 사례를 지원하는 여러 스케줄링 및 batching 알고리즘을 포함함
Latency와 throughput 요구 사항 간의 균형을 유지해야 하며 올바른 값은 개별 사용 사례에 따라 다름
동적 batching은 추론 요청을 서버에서 결합하여 batch가 동적으로 생성되도록 함

(3) Kubernetes에서 Triton 추론 서버 auto-scaling

일반적으로 devops 관리자는 서버 부하(초당 요청 수)를 측정한 다음 부하에 따라 리소스를 추가하고 부하가 감소하면 리소스를 축소함
Kubernetes는 Horizontal Pod Autoscaler(HPA)를 사용하여 이 workflow를 자동화함
Kubernetes HPA는 CPU 사용률과 같은 메트릭을 기반으로 배포, 복제 컨트롤러 또는 복제본 세트의 Pod 수를 자동으로 조정함
GPU 사용률, duty 사이클 등과 같은 사용자 지정 메트릭을 HPA에 제공함으로써 Triton Inference 서버 Pod는 이러한 메트릭을 기반으로 온디맨드로 자동 확장 가능함

레퍼런스

[1] Accelerate your AI Journey on Google Cloud & NVIDIA

Accelerate Your AI and HPC Journey on Google Cloud (Presented by Google Cloud) | NVIDIA On-Demand

At Google, our goal is to offer the best GPU-accelerated solutions that are easy to use, scale as needed, and fit your…

www.nvidia.com

[2] How To Modernize Your Enterprise Data Center To Be AI-Ready

How to Modernize your Enterprise Data Center to be AI-Ready | NVIDIA On-Demand

While enterprise use of AI has grown by 270% over the past several years, many still struggle to make their AI…

www.nvidia.com

[3] Tuning Visualized GPUs for Optimal Performance on ML/AI Workloads

Tuning Virtualized GPUs for Optimal Performance on ML/AI Workloads | NVIDIA On-Demand

VMware and NVIDIA have partnered to democratize AI for every enterprise by combining NVIDIA AI software and GPUs with…

www.nvidia.com

[4] Running Cloud-native Apps in NVIDIA AI Enterprise

G TC 2022 Session별 highlights — Data Center: Networking, Virtualization, Cloud

1. Accelerate your AI Journey on Google Cloud & NVIDIA

[Google Cloud의 경쟁력]

[Cloud AI & Industry Solution Portfolio]

[Google Cloud NVIDIA GPU Portfolio]

[A2 VM family with NVIDIA A100s]

[Vertex AI]

[왜 AI/ML에 Google Kubernetes Engine인가?]

2. How To Modernize Your Enterprise Data Center To Be AI-Ready

[NVIDIA End-to-End AI SW Stack]

[NVIDIA AI Enterprise SW]

[NVIDIA RAPIDS]

[NVIDIA RAPIDS: Lightning-Fast End-To-End Performance]

[NVIDIA TAO Toolkit]

[Tritron Inference Server]

[NVIDIA AI Enterprise를 통한 AI 워크로드 제공]

[VMWARE Cloud Director 10.3.2]

[Recommended Accelerators For NVIDIA AI Enterprise]

3. Tuning Visualized GPUs for Optimal Performance on ML/AI Workloads

[Scaling of Training Performance with NVLink Connected vGPUs]

[vGPU vs. MIG with vGPU with A100]

[vGPU vs. MIG with vGPU]

[vGPU vs. MIG with vGPU: Sizing the ML Workload]

[MIG or vGPU — Summary]

4. Running Cloud-native Apps in NVIDIA AI Enterprise

[NVIDIA End-To_End AI SW Suite]

[Data Pre-processsing]

[Model Training]

[Model Inference]

레퍼런스

Accelerate Your AI and HPC Journey on Google Cloud (Presented by Google Cloud) | NVIDIA On-Demand

At Google, our goal is to offer the best GPU-accelerated solutions that are easy to use, scale as needed, and fit your…

How to Modernize your Enterprise Data Center to be AI-Ready | NVIDIA On-Demand

While enterprise use of AI has grown by 270% over the past several years, many still struggle to make their AI…

Tuning Virtualized GPUs for Optimal Performance on ML/AI Workloads | NVIDIA On-Demand

VMware and NVIDIA have partnered to democratize AI for every enterprise by combining NVIDIA AI software and GPUs with…

Running Cloud-native Apps in NVIDIA AI Enterprise | NVIDIA On-Demand

This session continues the technical deep dive and will cover the various technological components for scaling training…

Written by daewoo kim

No responses yet