G TC 2022 Session별 highlights — LLM(Large Language Model) 학습

daewoo kim

8 min readApr 10, 2022

이번 GTC 2022에 총 1000여 개의 Session이 발표되었다. 이중에서 주요 topic별로 highlight를 요약하였다. 그 첫번째로 LLM 학습에 대해 살펴본다.

1.LLM(Large Language Model) 학습

2.Data Center: Networking/Virtualization/Cloud

3. HPC: Supercomputing

1. How to Avoid the Staggering Cost of Training SOTA LLMs [S41904]

[Impact of High Cost of Training LLMs]

Large Model 학습에 고비용이 필요하다. 이는 다음과 같은 영향을 미친다.

Economic Impact: 수백B 파라미터를 학습하기 위해선 수백만$가 필요함
Environmental impact: CO2 배출 문제
Barrier to entry: 스타트업과 작은 대학은 SOTA 모델에 접근 불가
Resource contention on large clusters: 다수의 팀과 프로젝트가 제한된 컴퓨팅 예산으로 컴퓨팅 클러스터를 공유함

[학습 시 LLMs가 더 싼 이유]

(1) Sample Efficiency

(2) Compute Efficiency

(3) Significantly Better Few-shot Performance

(4) Smaller Quantization & Pruning Errors

[LLMs 학습 비용을 주도하는 것]

데이터셋 사이즈 (# of words)
모델 사이즈 (# of parameters)
학습 볼륨 (# of tokens processed during pre-training)

[LLMs 학습 비용을 줄이는 방법]

(1) 학습 기술의 개선

알고리즘적 개선: DeepSpeed-MoE for NLG, Primer 등
더 좋은 optimizer: ZeRO, ZeRO-Infinity
더 좋은 Memory Utilization: Gradient, Activation Checkpointing 등
더 좋은 병렬화 및 분산 학습: Megatron-Turing NLG 530B

(2) HW 혁신

Tensor Cores with NVLink & NVSwitch, HDR InfiniBand Networking
Targeting Structured Sparsity present in the networks

(3) 프레임수준 최적화

PyTorch DDP(DistributedDataParallel)

하지만 현재 학습 방법은 목표 성능 및 확장 효율성을 충족하기 위해 느리고 수동의 hyper-parameters 검색 및 다중 탐색 실행에서 발생하는 학습 프로세스의 비효율성을 해결하지 못 한다.

[무엇이 LLMs의 학습비용을 늘리나?]

(1) LLMs 학습에 숨겨진 비용

데이터 사이즈/모델 사이즈/학습 볼륨과 같은 직접 비용에 추가하여 최적 모델에 도달하기 위한 실험 비용이 추가된다.

(2) 의미있는 모델를 학습하기 위한 Multiple Runs

A. 안정적인 hyperparameter 구성에 도달하기 위한 Multiple Runs
B. 학습 도중 높은 scaling 효율성을 달성하기 위한 Multiple Runs
C. 높은 throughput과 낮은 low latency를 달성하기 위한 Multiple 추론테스트
D. 다른 모델 사이즈별로 A~C 단계를 수차례 반복

(3) 추가 고려해야할 사항

내 HW를 위해 적합한 모델 사이즈는?
수렴에 가장 민감한 Hyper-parameters는?
학습 throughput을 개선하기 위한 방법은?
GPU를 추가할 때 변경해야 하는 Hyper-parameters는?
내 compute를 최적화하는 방법은?

(4) NeMo-Megatron Hyperparameters Search tool 사례

5B GPT3 Models & DGX A100 GPUs
부적절한 Hyper-parameter config는 모델 학습을 며칠까지 느리게 할 수 있음

[NEMO-Megatron Tooling]

당신의 HW를 위해 최적 모델 구성을 선택하여 모델 학습에 필요한 간접적인 비용을 줄일 수 있다.

Step 1: 학습 & 추론 제약조건 집합으로부터 최적 모델 사이즈를 찾는다.
Step 2: 주어진 모델 사이즈에서 좋은 config를 제공함 (learning rate, weight initialization, optimizer, weight decay, dropout, data type[fp16, bf16], global batch size)
Step 3: Step 2에서 주어진 config에서 모델 학습에 최적 방법을 찾기 위해 grid search를 실행함 (Tensor/Pipeline/Data Parallelism, Micro Batch Size,# of Gradient Checkpointing Layers)

[Performance Gains]

(1) 5B GPT3 (training speedup & inference throughput/latency)

(2) 175B GPT3 (training speedup & inference throughput/latency)

[최적 Configurations (175B GPT-3 Model)]

(1) 최적 학습 Config

TP=8, PP=6, MBS=1, Activation Checkpointing Layers=1
33.92s per global step

(2) 최적 추론 Config

# Highest throughput

TP=8, PP=4, MBS=64 (29.1 samples/sec @ 2121ms of latency)
TP=8, PP=8, MBS=64 (25.6 samples/sec @ 2009ms of latency)

# Lowest latency

TP=8, PP=8, MBS=4 (4 samples/sec @ 971 ms of latency)

[Hyper-Parameters Tool의 장점]

GPT3, T5, mT5 지원 (곧 추가 모델 지원)
사용자의 HW 제약사항에 기반한 모델 사이즈 결정
모델 사이즈마다 휴리스틱으로 좋은 베이스라인 Config를 제공: learning rate, weight initialization, optimizer, weight decay, dropout, data type, global batch size
수 분(small models)에서 수 시간(large models)내 최적 학습과 추론 config를 찾아낼 수 있음
최적의 config을 사용하여 몇 분 안에 수렴하도록 모델 훈련을 시작 가능함

[결론]

시간, 연산, 돈 절감
더 빠른 실험과 연구를 위한 Quick Iteration

2. Faster Neural Network Training, Algorithmically

G TC 2022 Session별 highlights — LLM(Large Language Model) 학습

1. How to Avoid the Staggering Cost of Training SOTA LLMs [S41904]

[Impact of High Cost of Training LLMs]

[학습 시 LLMs가 더 싼 이유]

[LLMs 학습 비용을 주도하는 것]

[LLMs 학습 비용을 줄이는 방법]

[무엇이 LLMs의 학습비용을 늘리나?]

[NEMO-Megatron Tooling]

[Performance Gains]

[최적 Configurations (175B GPT-3 Model)]

[Hyper-Parameters Tool의 장점]

[결론]

2. Faster Neural Network Training, Algorithmically

[결론]

Written by daewoo kim

No responses yet