Taming Computational Challenges with Quantization, DDP and FSDP

Introduction:

Large Language Models (LLMs) are at the forefront of natural language processing, but they come with a significant computational burden. Storing an LLM can be a memory-intensive task. For instance, if you need to store an LLM with one billion parameters, it would occupy around 4 GB of memory. However, training an LLM introduces additional parameters like Adam optimizer states, gradients, activation variables, and temporary variables, which can easily consume up to 24 GB of memory. This memory challenge is a serious roadblock in the world of LLMs.

Quantization:

To optimize memory usage, one powerful technique is quantization. Quantization reduces the precision of LLM parameters from higher to lower precision points. While this approach can significantly reduce memory usage, it may result in a loss of data, which is often an acceptable trade-off.

Distributed Data-Parallel(DDP):

Quantization is effective, but if you're aiming for top-notch performance and speed, leveraging multiple GPUs can be a game-changer. Distributed Data-Parallel (DDP) is a multi-GPU strategy that can help you achieve that. In DDP, you deploy your LLM on each GPU, split your data, and pass each split to a dedicated GPU. The outputs are then collected and synchronized to update gradients, making it a highly efficient parallel processing technique.

However, DDP has a drawback. If your LLM exceeds your GPU's storage capacity, you might face difficulties. Additionally, redundancy can be observed in DDP. To overcome these issues, you can turn to Full Sharded Data Parallel (FSDP), which leverages the ZeRO memory optimization technique.

Full Sharded Data Parallel (FSDP):

ZeRO operates in three stages, each progressively optimizing memory usage. In the first stage, only optimizer states are sharded across all GPUs. The second stage includes optimizer states and gradients. The final stage goes all-in, sharding optimizer states, gradients, and model parameters.

Combining ZeRO's stages with the DDP concept results in FSDP. In this process, instead of storing the entire LLM on each GPU, a shared LLM is stored, and split data is passed to each GPU. The outputs are collected, combined, and then forwarded for both forward and backward propagation. The final step involves gradient synchronization and weight updates.

With ZeRO and FSDP in your toolbox, you can efficiently address the computational challenges posed by large LLMs. These techniques open up new possibilities in natural language processing, allowing you to tackle more complex tasks and push the boundaries of LLM applications.

Taming Computational Challenges with Quantization, DDP and FSDP

Table of contents

Introduction:

Quantization:

Distributed Data-Parallel(DDP):

Full Sharded Data Parallel (FSDP):