AI service를 위한 현실적인 접근 방법 (2): Model Lightweight

8 min readAug 7, 2021

Model을 경량화하는 방법에는 경량화 딥러닝 알고리즘을 사용하거나 모델을 압축하는 방법이 있다.

이전 글에서 T4 GPU를 이용하여 챗봇 서비스를 만들기로 결정하였다. 그리고 챗봇 서비스가 23GB의 GPU 메모리 용량을 요구하기 때문에 T4 GPU 메모리 용량인 16GB에 맞도록 모델 경량화를 진행할 계획이다.

AI service를 위한 현실적인 접근 방법 (1): Memory Capacity Limitation

NLP 모델이 커지면서 GPU의 Memory Capacity 제한이 학습 뿐만이 아니라 추론에서도 문제가 되고 있다. 현실적으로 이 문제를 해결하는 방법론에 대해서 알아보기로 하자.

moon-walker.medium.com

추론 시 왜 Model 경량화가 필요할까?

학습 모델의 경우, 모델의 성능(엄밀히 말해 정확도)을 높이는 것이 가장 큰 목표이다. NLP 응용의 경우, Scaling Law에 의해 모델 크기가 커질수록 모델의 성능도 증가한다는 경험적인 법칙이 정립되면서 2020년도 OpenAI의 175B개 파라미터의 GPT-3가 등장하였으며, 2021년 6월에는 BAAI의 Wu Dao 2.0이 GPT-3보다 10배 큰 1750B 모델을 발표하였다.

하지만 이렇게 거대한 모델을 상용화하여 추론 서비스를 한다면 제한된 컴퓨팅 리소스에서 추론을 실행하는 것은 매우 어렵고 실용적이지 않다.

왜냐하면 실제 production 수준에서 추론 서비스를 하기 위해선 정해진 latency 이내 서비스를 하는 것이 중요하나, 모델 크기가 커질수록 추론에 걸리는 시간 역시 증가하기 때문에 서비스 latency를 커지는 문제점에 직면할 수 밖에 없다.

그렇다고 모델 파라미터를 줄이면 모델 성능을 애써 달성한 성능을 포기하는 것이고, 더 적은 모델 파라미터를 사용하는 새로운 모델 구조를 만드는 것은 어려운 과제이다.

따라서 학습된 모델 성능을 달성하면서 추론에 사용되는 모델을 줄여 추론 시간을 줄이는 모델 경량화(Model Lightweight)를 주로 선택한다. 모델 경량화 방식은 크게 두 가지로 나눌 수 있다.

1.경량화 딥러닝 알고리즘(Lightweight Deep Learning Algorithm)

성능을 유지하면서 모델의 기본 구조를 변경하여 더 적은 파라미터를 사용하거나 더 적은 연산을 사용하는 구조로 변환하여 방법이다.

2.모델 압축(Model Compression)

학습된 모델의 redundancy를 줄이는 방법으로 Quantization, Pruning, Weight Factorization, Knowledge Distillation, Weight Sharing 등이다.

경량화 딥러닝 알고리즘을 개발하기 위해선 기존 딥러닝 알고리즘 구조를 이해한 후 구조를 수정하여야 하므로 특정 알고리즘마다 새로운 방법을 개발해야 하는 문제가 있다.

따라서 경량화 딥러닝 알고리즘을 새로 개발하는 것보다는 빠른 추론 서비스를 위해 모델 압축을 사용하기로 한다.

Model Compression 방식의 종류

Quantization

가장 널리 사용되는 Model Compression 방식으로 모델 파라미터를 low precision으로 표현하여 계산과 메모리 access 속도를 높이는 방법이다. Quantization은 크게 PTQ(Post Training Quantization)과 QAT(Quantization Aware Training)으로 나뉜다.

PTQ: 말그대로 학습된 모델을 Quantization하는 것으로 속도가 빠른 대신 INT8 이하의 low precision에서 정확도가 떨어지는 문제가 있음
QAT: 모델의 weight와 activation를 학습하는 과정에서 Quantization을 수행하여 속도가 느린 대신 INT8 이하 low precision에서 정확도를 보존할 수 있음

Pruning

Pruning은 학습 후에 뉴럴 네트워크 내 불필요한 부분을 제거하는 방법으로 실제 두뇌가 배우는 방법으로부터 영감을 얻었다.

|weight| < threshold 이 weight를 제거함
Pruning을 반복 실시하여 효과적인 연결을 학습함
Conv 와 FC 를 비교하면 Conv 레이어가 pruning에 더 민감함
Sparsity는 가속기의 Zero-Skipping 기능을 통해 추론 속도를 높일 수 있음
Pruning 시 정확도를 유지하면서 파라미터와 FLOPS를 감소시킴

Knowledge Distillation

Knowledge Distillation(KD)은 큰 모델(teacher model)가 학습된 지식을 작은 모델(student model)로 “증류”하는 방법으로 모델의 크기를 줄이는 방법으로 Jeffery Hinton 교수가 ‘Distilling the Knowledge in a Neural Network” 논문에서 최초로 제안하였다.

Teacher model은 일종의 soft label을 만들어내는 역할으로 작용함
KD를 이용하면 일반적으로 student model만 학습시키는 것보다 성능이 높아지는 것으로 알려짐
Computer Vision 뿐만 아니라 NLP에서 활용됨
Teacher model인 BERT-base(layer 12개)보다 절반 크기의 작은 모델(layer 6개)인 DistilBERT는 BERT-base에 비해 정확도 차이가 3%밖에 나지 않음

KD는 다음과 같은 순서로 진행한다.

1) 가장 teacher model을 학습
2) Teacher model로 student model을 학습시킬 때, 실제 학습에 사용되는 loss와 student와 teacher의 예측 차이를 줄이는 loss의 합을 loss로 두고 student model을 학습시킴

위 Pruning, Quantization, KD는 다음과 같이 각 Model compression 방식을 조합하여 사용할 수 있다.

Pruning + Knowledge Distillation
Quantization + Knowledge Distillation
Combining Knowledge Distillations

레퍼런스

[1] Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing…

arxiv.org

[2] Model Compression Story — Lightning Talk

[3] All The Ways You Can Compress BERT

All The Ways You Can Compress BERT

Model compression reduces redundancy in a trained neural network. This is useful, since BERT barely fits on a GPU…

mitchgordon.me

[4] Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing…

arxiv.org

[5] Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

In the natural language processing literature, neural networks are becoming increasingly deeper and complex. The recent…

arxiv.org

[6] 8 Neural Network Compression Techniques For ML Developers

8 Neural Network Compression Techniques For ML Developers

As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost…

analyticsindiamag.com

AI service를 위한 현실적인 접근 방법 (2): Model Lightweight

AI service를 위한 현실적인 접근 방법 (1): Memory Capacity Limitation

NLP 모델이 커지면서 GPU의 Memory Capacity 제한이 학습 뿐만이 아니라 추론에서도 문제가 되고 있다. 현실적으로 이 문제를 해결하는 방법론에 대해서 알아보기로 하자.

추론 시 왜 Model 경량화가 필요할까?

Model Compression 방식의 종류

Quantization

Pruning

Knowledge Distillation

레퍼런스

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing…

All The Ways You Can Compress BERT

Model compression reduces redundancy in a trained neural network. This is useful, since BERT barely fits on a GPU…

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing…

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

In the natural language processing literature, neural networks are becoming increasingly deeper and complex. The recent…

8 Neural Network Compression Techniques For ML Developers

As larger neural networks with more layers and nodes are considered, reducing their storage and computational cost…

Written by daewoo kim

No responses yet