Back to All Events

LoRA: Why does fine-tuning work?

We often say that when we train an ML system, it spends most of its time learning basic concepts and how to compose them. A computer vision net first has to learn what edges and gradients are, and then what circles are, and only much much later does it learn to recognise faces. A language model first has to learn which words are nouns and which are adjectives, and only then can it learn that “fire” (a noun) is “hot” (an adjective). With this intuition in hand, we notice that you don’t need a labelled dataset to teach a network about edges and gradients, because edges and gradients show up everywhere. A simple self-supervised objective like “which of these four images is not upside down” or “predict the next word in this document” works fine. Once the network has learned all the basics, you can retrain it on a small labelled dataset for image classification or question answering and it will perform surprisingly well; sometimes better than training from scratch on a big labelled dataset!

That’s the story, but how do we know it’s true? In this session Blaine Rogers will use the LoRA paper as a jumping off point to talk about what evidence we have for the “first learn the basics” theory, why fine-tuning works, and how it relates to over-parameterisation and distillation.

An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than finetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

http://arxiv.org/abs/2106.09685

Previous
Previous
14 June

Natural Abstractions

Next
Next
28 June

Separation of Capabilities in LLMs