Unlocking Efficiency with PEFT: The Ultimate Guide to Parameter Efficient Fine Tuning for LLMs

Introduction:

In the realm of Large Language Models (LLMs), fine-tuning has become the preferred method to tailor models for specific tasks, offering a reduction in computational costs, time, and dataset requirements compared to pretraining. However, even fine-tuning comes with its challenges, especially when it comes to memory usage. LLMs require substantial memory not only for their weights but also for optimizer states, gradients, forward activations, and temporary memory, which can be 12 to 20 times more than what is needed to run the LLM.

Enter PEFT, or Parameter Efficient Fine-Tuning, a technique that addresses these memory concerns. PEFT offers a solution to LLM fine-tuning that significantly reduces the memory overhead. In this blog post, we delve into the concept of PEFT and its three key approaches: Selective, Reparameterization, and additive.

The Memory Challenge

Before we explore PEFT, it's crucial to understand the memory challenge associated with fine-tuning LLMs. The significant memory allocation for weights, optimizer states, gradients, activations, and temporary storage can lead to resource-intensive operations.

Introducing PEFT

PEFT, or Parameter Efficient Fine-Tuning, provides an innovative way to fine-tune LLMs with memory efficiency in mind. It's comprised of three primary approaches:

Selective

Selects a subset of initial LLM parameters to fine-tune.
Offers significant trade-offs in terms of memory reduction.

Reparameterization

Reparameterizes model weights using a low-rank representation.
Provides an elegant solution to the memory challenge.

Additive

An additive approach that retains the original LLM as a frozen core.
Adds trainable layers or parameters to the model.

How LoRA Works:

The LoRA approach, in particular, injects an extra operation before the self-attention layer of the LLM. It creates a new matrix with the same rank as the weights and trains the weights or a smaller matrix. Research suggests that updating weights of the attention layer is enough but we can also use LoRA in feed-forward neural networks During the inference process, the following steps occur:

Multiply weights with a new matrix: \(B * A = R^{B \times A}.\)
Add the result to the original weights: \( W + R^{B \times A}.\)

For example, consider a base Transformer model with weights \(R^{512 \times 64}\), where \(d=512\) and \(k=64\), resulting in 32,768 trainable parameters. With LoRA and a rank r=8, we create matrices A and B with dimensions \(A^{8 \times 64} = 512\) and \(B^{512 \times 8} = 4,096\) trainable parameters. The final result is a matrix \(R^{512 \times 64}\), corresponding to the original Transformer dimension. This approach yields an impressive 86% reduction in training parameters.

The Benefits of PEFT

PEFT, particularly LoRA, allows you to easily train matrices for individual tasks and then add these matrices to the frozen weights. Using this approach we can perform multiple tasks as we just need to train the matrices and add these matrices during inference time as required.

The result is an 86% reduction in resources required for training while still maintaining performance, with only a 3.2% decrease in performance when compared to full fine-tuning.

Selecting the appropriate rank for the decomposition matrix may require some experimentation, but researchers have found that ranks between 4 and 32 typically yield favorable results, making 4 or 8 a good starting point.

Soft Prompts: Prompt Tuning

In addition to PEFT, another promising technique is Prompt Tuning. Unlike prompt engineering, which involves manual intervention, soft prompts are prepended to actual prompts and receive values through supervised learning. These soft prompts can be considered virtual tokens that don't correspond to words in the LLM's vocabulary but are trained to align with words in the prompt.

Similar to Lora, we can use prompt tuning for multiple tasks. we can train multiple soft prompts and during inference can simply prepend them to the prompt. If we consider the difference between full fine-tuning and soft prompts, then we should understand that in full fine-tuning, we would be updating all the weights of the LLM and in prompt tuning, we just calculate soft prompts and prepend them.

As the size of the model increases, soft prompt tuning yields results similar to full fine-tuning, making it a valuable tool for a wide range of use cases.

Conclusion:

In conclusion, PEFT and soft prompt tuning provide innovative solutions to the memory and fine-tuning challenges associated with LLMs. They enable memory-efficient fine-tuning while preserving high performance, making them indispensable tools for leveraging the full potential of LLMs in various applications. Whether you are looking to save on computational resources or fine-tune models for specialized tasks, PEFT and soft prompt tuning offer a powerful approach to unlocking the efficiency of Large Language Models.