Understanding Low Rank Adaptation (LoRA) in Fine Tuning LLMs

Healthcare: Population Health Management

May 24, 2024

Understanding Low Rank Adaptation (LoRA) in Fine-Tuning LLMs

This blog post will go into detail about how LoRA works to fine-tune LLMs, following the methodology set out in the “LoRA: Low-Rank Adaptation of Large Language Models” paper

Image by Author — generated by Stable Diffusion 2.1

Fine-Tuning is perhaps one of the most discussed technical aspects when it comes to Large Language Models. Most people understand that training these models is expensive and requires significant capital investment, so it is exciting to see that you can create a model that is somewhat unique by taking a pre-existing model and fine-tuning it with your own data.

There are multiple methods to fine-tune a model, but one of the most consistently popular currently is the LoRA method (short for Low Rank Adaptation) discussed in the “LoRA: Low-Rank Adaptation of Large Language Models” paper.

Before we dive into the mechanics behind LoRA, we need to understand some Matrix background and some of the basics of fine-tuning a machine learning model.

Matrix Background Terminology

Practically all machine learning models store their weights as matrices. Consequently, having some understanding of linear algebra is helpful to get intuition on what is happening.

Beginning with some basics and then going from there, we have rows and columns in a matrix

Image by Author

Naturally, the more rows, columns, or both that you have, the more data your matrix takes up. Sometimes, there exists a mathematical relationship between the rows and/or columns that can be used to reduce the space needed. This is similar to how a function takes up a lot less space to represent than holding all of the coordinate points it represents.

See the example below for a matrix that can be reduced to just 1 row. This shows the original 3×3 matrix has a rank of 1.

Image by Author

When a matrix can be reduced like the above, we say that it has a lower rank than a matrix which cannot be reduced as such. Any matrix of a lower rank can be expanded back to the larger matrix like the below

Image by Author

Fine-Tuning Background

To fine-tune a model, you need a quality dataset. For example, if you wanted to fine-tune a chat model on cars, you would need a dataset with thousands of high-quality dialogue turns about cars.

After creating the data, you would then take those data and run them through your model to get an output for each. This output is then compared to the output expected in your dataset and we calculate the difference between the two. Typically, a function like cross entropy (which highlights the difference between 2 probability distributions) is used to quantify this difference.

Image by Author

We now take the loss and use it to modify the model weights. We can think of this like creating a new ΔW matrix that has all of the changes we want the Wo matrix to know. We take the weights and determine how we are going to change them so that they give us a better result in our loss function. We figure out how to adjust the weights by doing backpropagation.

If there is sufficient interest, I’ll write a separate blog post on the math behind backpropagation as it is fascinating. For now, we can simply say that the compute necessary to figure out the weight changes is costly.

LoRA Methodology

LoRA revolves around one critical hypothesis: while the weight matrices of a machine learning model are of high rank, the weight updates created during fine-tuning are of low intrinsic rank. Put another way, we can fine-tune the model with a far smaller matrix than we would need to use if we were training it from scratch and not see any major loss of performance.

Consequently, we setup our basic equation like so:

Equation 3 from the paper

Let’s look at each of the variables above. h is the value of the weight after fine-tuning. Wo and ΔW are the same from before, but the authors have created a new way to define ΔW. To find ΔW, the authors created 2 matrices: A and B. A is a matrix that has the same columnar dimension as Wo and begins filled with random noise, while B has the same row dimensions as Wo and is initialized to all 0s. These dimensions are important because when we multiply A and B together, they will create a matrix with the exact same dimensions as ΔW.

Figure 1 from the paper

The rank of A and B is a hyper-parameter set during fine-tuning. This means we could choose rank 1 to speed up training the maximum amount (while still having a change to Wo) or increase the rank size to potentially increase performance at a greater cost.

Fine-Tuning with LoRA

Returning back to our image from before, let’s see how calculation changes when we use LoRA.

Remember that fine-tuning means creating ΔW matrix that holds all of our changes to the Wo matrix. As a toy example, let’s say that the rank of A and B is 1, with a dimension of 3. Thus, we have a picture like the below:

Image by Author

As each cell in the matrix contains a trainable weight, we see immediately why LoRA is so powerful: we have radically reduced the number of trainable weights we need to compute. Consequently, while the calculations to find individual trainable weights typically remain the same, because we are computing these far fewer times, we are saving a TON of compute and time.

Conclusion

LoRA has become an industry-standard way to fine-tune models. Even companies that have tremendous resources have recognized LoRA as a cost effective way to improve their models.

As we look to the future, an interesting area of research would be finding the optimal rank for these LoRA matrices. Right now they are hyperparameters, but if there were to be an ideal one that would save even more time. Moreover, as LoRA still requires high quality data, another good area of research is the optimal data mix for the LoRA methodology.

While the money flowing into AI has been immense, high spending is not always correlated with a high payoff. In general, the farther companies are able to make their money go, the better products they can create for their customers. Consequently, as a very cost-effective way to improve products, LoRA has deservedly become a fixture in the machine learning space.

It is an exciting time to be building.

[1] Hu, E., et al., “LoRA: Low-Rank Adaptation of Large Language Models” (2021), arXiv

[2] Hennings, M. et al., “LoRA & QLoRA Fine-tuning Explained In-Depth” (2023), YouTube

Understanding Low Rank Adaptation (LoRA) in Fine Tuning LLMs was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Add to favorites

May 24, 2024

0 Comments

Submit a Comment Cancel reply

You must be registered in the site to post a comment. Please Login if you already have account or Register.