You can now train a 70b language model at home

We’re releasing an open source system, based on FSDP and QLoRA, that can train a 70b model on two 24GB GPUs.
Published

March 6, 2024

Summary

Today, we’re releasing Answer.AI’s first project: a fully open source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs (RTX 3090 or 4090). This system, which combines FSDP and QLoRA, is the result of a collaboration between Answer.AI, Tim Dettmers (U Washington), and Hugging Face’s Titus von Koeller and Sourab Mangrulkar.

This system will help the open source community release better models. Teknium, the creator of the extremely popular OpenHermes models and datasets, with over half a million downloads, said:

With this capability we can take huge models to new heights locally, and gigantic, hundreds of billions of parameter models are now accessible by small labs.

At Answer.AI we made this our first project because it’s a key foundation of our north star: helping make useful AI available to everyone. Just being able to use other people’s models is not enough. We want everyone to be able to create their own personalized models, so that they are in control of their own AI systems.

Background

The big idea

There are two very different levels of hardware used to train deep learning models. There is the data center class hardware, such as H100s and A100s, costing hundreds of thousands of dollars. Then there are desktop computers containing gaming GPUs, such as dual 4090s, costing under $10,000 (and which can be assembled from 2nd hand parts for less than half the price of a pre-built system).

But here’s the key point: the gaming GPUs have similar performance to the data center GPUs that cost over 10x more! It would be great if we could use these 10x cheaper (but nearly as fast) cards to train large language models, but we can’t, because they have much less memory. The best currently available data center cards have 80GB RAM, whilst gaming cards max out at 24GB RAM. Since only the largest models produce the best results, creating the best models has been largely inaccessible to most people.

We realized that there’s actually no intrinsic reason for this. The super fast hardware is there, waiting to be used – we just need a way to feed it with the model and the data in a way that meets its memory constraints. The obvious question is: why hasn’t this been done then? All the big industry labs have the 10x more expensive hardware already, so they don’t really have the incentive to figure this out.

The big idea here is simple: figure out how to use these cheaper, lower-memory gaming GPUs to train the best available open source models. So the goal is this: train a 70 billion parameter (70b) model using only gaming GPUs, which means our per-GPU memory will be at most 24GB. It’ll be a challenge, because each parameter normally takes 16 bits (2 bytes), so that’s 70*2=140GB to even store the weights – and that’s without including all the other data such as activations, gradients, and optimization state!

Why this project?

Answer.AI is a very unusual type of organization – a for-profit R&D lab closer in spirit to 19th century electricity labs than to today’s AI research groups. Figuring out how to make large model training inexpensive and accessible is just the kind of thing Eric Ries and Jeremy Howard hoped we’d be able to do when the organization was launched at NeurIPS last year.

Solving this problem is hard. It requires understanding many separate libraries (e.g bitsandbytes, PEFT, Transformers, Accelerate, and PyTorch), and computer science and math concepts (e.g discretization, distributed computing, GPU programming, linear algebra, SGD concepts such as gradient checkpointing), and how they all interact.

Academia is full of brilliant people that solve hard problems. But academia hasn’t solved this particular problem. That’s because it’s difficult for university researchers to justify spending time on this kind of work. Combining existing tools and techniques together isn’t generally considered “novel” enough to result in publication in a high impact journal, but that’s the currency that academics need. Furthermore, academics are generally expected to become highly specialized within their field, making it challenging to bring together so many pieces into a single solution.

And, of course, big tech companies are also full of brilliant people that solve hard problems. But this particular problem, training models with consumer GPUs, isn’t a problem they need to solve – they’ve already bought the big expensive GPUs! Many startups are also full of brilliant people that solve hard problems! But, as Eric Ries explains, “today’s financial market forces businesses to prioritize short-term gains over everything else”. It’s extremely hard for a startup to justify to investors why they’re spending their funds on open source software and public research.

Whilst academia, big tech, and startups had good reasons for not solving this problem, these are the exact reasons that this problem was a great fit for Answer.AI. Everyone who works at the company has built the kinds of systems that we had to work with on this problem, so we were able to understand how all the pieces fit together. People who love to both deeply understand the foundations of software and AI, and also love to hack at fun and interesting end-to-end systems are the kinds of people who are drawn to Answer.AI, and vice versa.

The problems we choose to solve together are selected by the same people that will do the solving. So we tend to pick up projects that involve bringing together multiple ideas together to create practically useful solutions. And because we’re a public benefit company with a charter to produce long term benefit from AI, open source software and public research are directly in line with our mission.

QLoRA: Train bigger models on a single GPU

Two projects have been released recently that took the first critical steps towards making this a reality: QLoRA (by Tim Dettmers et al), and FSDP (by Meta’s PyTorch team).

QLoRA is a simple but brilliant combination of two critically important advances in modern neural networks: quantization, and LoRA. Quantization is a technique where, instead of using 16 or even 32 bits to store the weights of a neural network, 4 (or even fewer) bits are used. There are only 16 possible values of a 4 bit number, but Dettmers and Zettlemoyer showed that this can be enough in the large language models that are popular today. Tim Dettmers made these 4-bit “quantized” models easy to create, thanks to his bitsandbytes library, and recently Hugging Face has stepped in to help maintain and document this library, particularly thanks to the initiative of Titus von Koeller.

Unfortunately, once a model is quantized, it can not be trained any further with regular approaches – with just 16 possible values, the gradient descent method used for model training will observe zero gradients nearly everywhere, so it can’t make any updates to the quantized weights. This is a major problem, because it means that quantization can only be used for inference, not for continued pre-training or fine-tuning. Whilst inference is useful and important, it’s really just consuming models. But we want everybody to be able to contribute to creating models!

The trick to avoiding this limitation is to use LoRA – “Low-Rank Adaptation of Large Language Models”. LoRA doesn’t train the whole large language model at all, but instead adds “adaptors”, which are very small matrices (generally smaller than 1% of the full model) that are trained, whilst keeping the rest of the model constant. If you’ve played with models like Stable Diffusion, you will have probably seen these adapters many times; it’s how those models are generally shared, and why they are so small and fast to download.

Tim realized that LoRA can be combined with quantization: use a quantized base model, which is not changed at all by the training, and add trainable LoRA adaptors that are not quantized. This combination is QLoRA. Tim’s team was able to use this to, for the first time, train a model that (unquantized) is larger than the GPU: they trained a 65b model (which is 130GB unquantized) on a 48GB card.

Hugging Face stepped in once again here, creating the PEFT library, which made LoRA training far simpler, and also integrating it directly with bitsandbytes to allow anyone to use QLoRA with just a few lines of code. The Hugging Face team has been working tirelessly behind the scenes to ensure that the open source community can use these technologies to train their models. If you’ve ever used Transformers to load a 4-bit model using a single function argument, then you’ve got them to thank (and even if you haven’t, you’ve almost certainly used the work of folks that have built their model with this ecosystem).

QLoRA didn’t quite slay the problem we set out to solve, to train a 70b model on 24GB cards, but it got closer than anything before. When quantized to 4 bits (which is 0.5 bytes), the 70b model takes 70/2 = 35 GB, which is larger than the 24GB gaming GPUs we want to use.

There are other limitations to QLoRA. A 48GB card is very expensive, and training a 65b model only just fits on such a card. That can be a problem, because we need to store lots of other things too, including the activations, gradients, and optimization state of the model during training. If there’s not much memory left over after loading the model weights, there’s not enough working memory to support training.

For instance, one of the benefits of language models is that we can use them to “chat” with, or understand, or analyze long documents or conversations. To make models that can handle long sequences like that, we need to show them examples of long sequences during training. The longest sequence used in training is called the “sequence length”. Trying to use anything but a short sequence length will cause an error when training a 65b QLoRA model on a 48GB card, because there isn’t enough memory to store all the information about the sequence; nearly all the memory is used just to store the model itself.

Furthermore, if the model can only look at a single sequence at a time, it’s going to take a really long time to get through all the data in our training set. So instead we want to be able to “batch” a few sequences together at a time. The number of sequences included is the “batch size”. When there’s very little space left on the GPU after loading the model weights, we can only use very small batch sizes, resulting in extremely slow training.

FSDP: Scale training to multiple GPUs

One obvious solution to the problem of the RAM limitations of a single consumer GPU, is to use more than one GPU! A very common approach in the open source community is to simply place a few layers of the model on each card. So then to train, you run the first few layers on the first GPU, then the next few on the second GPU, and so forth. For instance, a 70b (140GB) model could be spread over 8 24GB GPUs, using 17.5GB on each. There’s even a convenient setting in Hugging Face Transformers, device_map=’auto’, which you may well have used; that’s what this is actually doing behind the scenes. This does the job, but there’s a giant downside: only one GPU is ever active at a time, as all the others wait for their “turn”. That means that ⅞ of the compute is wasted.

Distributed Data Parallel (DDP) was previously the gold standard approach to training models across multiple GPUs efficiently. This requires keeping the full model on each GPU – if you have a small model (e.g. a 2b model, which takes 4GB RAM) you can simply load the whole thing onto each GPU separately, and have each GPU then churn through training examples in parallel. So for instance, if you had 4 GPUs, that’s a 4x training speedup. But DDP doesn’t work if the model doesn’t fit onto a GPU, with enough room to spare for the data needed for the training process.

So we need something that can split a model across GPUs (like device_map=’auto’) and also use them in parallel (like DPP). This is where Meta’s Fully Sharded Data Parallel (FSDP) library comes in. It “shards” a large model, by splitting its parameters across multiple GPUs, allowing all the GPUs to be used simultaneously. When a layer of the neural network is calculated on a particular GPU during training, all the required shards are copied there. Then, the calculation is made, and finally the copied parts are deleted from that GPU. Whilst this sounds terribly inefficient, actually by being smart about copying the data of the next layer at the same time the current layer is busy calculating, it’s possible for this approach to result in no slowdown compared to DDP.

FSDP’s ability to bring the performance of DDP to models that are larger than any one GPU has been a revelation. For instance, a 70b (70 billion parameter) unquantized model takes 140GB of RAM (because each parameter is stored as 16 bits, which is 2 bytes), but even NVIDIA’s H100 card (which costs around $40,000 for a single card!) falls short of what’s needed, with its 80GB RAM. But with FSDP, four H100 GPUs can be combined for a total of 320GB RAM.

(Mind you, such a machine is going to set you back around $150,000…)

Bringing FSDP and QLoRA together

At Answer.AI our north star is making useful AI more accessible. $150,000 to create your own high-quality personalized model definitely doesn’t count as accessible! So the first project we embarked on was to make it possible to use a desktop with consumer gaming GPUs to efficiently train a 70b model. We figured that if we could use QLoRA to reduce the size of a model by around 400% (so a 70b model would fit into 35GB RAM), and then we used FSDP to shard that across two or more 24GB consumer cards, that would leave enough RAM left over to train a model.

First steps

Jeremy and Tim in late 2023 discussed the idea of bringing FSDP and QLoRA together. Tim connected Jeremy with Titus von Koeller, and Jeremy and Titus worked together to try, explore, understand, and document the issues that occurred when the two libraries were combined.

Answer.AI’s Johno Whitaker put together an important first step: a simple standalone test script which allowed us to more deeply understand the problem, and test solutions. A key breakthrough came in early 2024 when Answer.AI’s Benjamin Warner and Titus independently came up with a key idea: store the quantized parameters in a selectable data type, where that storage data type is the same data type as the “computation type” of the model.

Benjamin had this prototyped within 24 hours of developing the idea, but then we discovered another problem: FSDP was not copying the quantization information needed for each shard to use the model! That’s because FSDP is quite opinionated on the subset of data it will sync between GPUs. We realized that if we quantized the model on each GPU the missing metadata would remain untouched on all GPUs. Furthermore, we had to move the “quantization state” (the information necessary to (de)quantize the parameters) from the parameters into the layers, in order to ensure they were not removed when FSDP moved shards.

Once we had those issues resolved, we were able to successfully train our first batch of data with a quantized model using FSDP! Benjamin and Answer.AI’s Kerem Turgutlu were able to package this up with all the tests and refactoring needed into a pull request for bitsandbytes. We’re extremely grateful to the maintainers of the bitsandbytes project, who were very responsive in shepherding our PR through their processes.

Mission accomplished, nearly

At this point, we once again figured that we’d have things tied up pretty quickly, and once again we under-estimated the complexity of the task! The first thing we realized was that it still wasn’t actually possible to load a quantized model that’s larger than a single GPU, since the loading and quantization process itself required the whole model to be put on one GPU.

Jeremy had spent a few weeks carefully studying Meta’s fantastic Llama-Recipes project, which was the best complete FSDP fine tuning implementation he had found, and by closely tracking how it worked together with bitsandbytes, along with Hugging Face’s PEFT, Transformers, and Accelerate projects, he managed to construct a minimal standalone script which manually complete all the steps needed to fine tune a model.

Benjamin realized that with some tweaks it would be possible to do the loading and discretization one layer at a time, thus avoiding the need to ever have the whole model on a single GPU. He also figured out how to prevent the PEFT library from moving the quantization state to CPU. Kerem wrote a custom implementation of the LoRA algorithm so that it could work with Benjamin’s changes.

And with that, we were able to fine tune a 70b model on dual 3090 gaming GPUs for the first time!

To make this work, we were benefiting not just from FSDP and QLoRA, but also from a vast array of clever techniques developed over the last couple of years by the academic and open source communities. We used:

  • Gradient checkpointing (also known as activation checkpointing) to avoid storing full gradients, instead saving activations at a number of ‘checkpoints’ throughout the model and then re-computing gradients by re-running the forward computation step as needed
  • CPU offloading to store weights in CPU RAM rather than on the GPU when they are not in use, drastically reducing the GPU memory required. This technique isn’t very useful to the “GPU rich” using H100 GPUs, which have highly optimized ways to pass weights to each other. But for our use case, it’s absolutely necessary, since gaming GPUs and motherboards don’t have these systems
  • Flash Attention 2 to efficiently calculate attention using a memory-optimized Cuda kernel.

Making these work together with FSDP and QLoRA wasn’t always straightforward. For instance, when using CPU offloading, Benjamin discovered and fixed a problem in bitsandbytes. Each time an “offloaded” weight was copied back to the GPU, it was being automatically re-quantized, which effectively turning the pretrained model into random weights! We made a pull request to bitsandbytes that kept track of which parameters had already been quantized, so we could avoid the redundant computation.

After all this work, we were very pleased to see that we could train large models with consumer GPUs. Jeremy had run detailed benchmarking of the original llama-recipes across a range of GPU hardware, and Kerem developed a comprehensive benchmarking system for the new project. In comparing the two we realized that we were still not able to use the sequence lengths or batch sizes we’d hoped – for some reason more memory was being used than we expected.

When we looked closely, it turned out that it wasn’t due to our FSDP/QLoRA integration at all – but actually as we increased seqlen in bitsandbytes even without FSDP, the memory usage went up super-linearly, eventually resulting in even higher memory usage than without quantization! It turns out that we’re not the first people to discover this problem. We don’t have a bitsandbytes solution yet (but it’s being investigated), but it did lead us to an exciting discovery…

Discovering HQQ

We love collaborating with like minded people, so when we saw the amazing work being done by Daniel Han on Unsloth we wanted to learn more, and see whether we might be able to help each other out. We asked Daniel if there were other interesting projects in this space that we should be watching, and he pointed us to HQQ.

To explain HQQ, we’ll need to give a bit of background first… The 4-bit quantization done by bitsandbytes uses a simple, fast, and clever approach where each group of parameters is normalized to a consistent range, and then each parameter is placed in a bucket, where the bucket breakpoints are based on an assumption that the parameters are normally distributed. This has the benefit that quantization is nearly instant, but because real model parameters will not exactly fit the assumed distribution, accuracy can suffer.

Other approaches such as GPTQ and the more recent AWQ go in a different direction, where the quantization parameters are optimized based on the actual behavior of the model when representative data is passed to it. These methods tend to produce more accurate models, potentially with even fewer than 4 bits per parameter; but they have the downside that the optimization process can take hours or even days for each model.

HQQ combines the best of both worlds. It is 50x faster to process a 70b model compared to GPTQ, yet is more accurate than it. Kerem decided to investigate whether HQQ would work well with FSDP.

He discovered that making HQQ and FSDP work well together took nearly the exact same steps as was required for bitsandbytes, and as result he had a complete working example completed within days. The mobius.ml folks couldn’t have been more responsive and helpful in ensuring that our PR was successfully merged – so we are now delighted to announce that FSDP works with HQQ too!

How to use FSDP/QLoRA

To use FSDP, you will of course need more than one GPU. If you don’t have access to such a system, you can rent a dual 3090 box from the Runpod Community Cloud for around $0.60/hour. There are many other providers to choose from; cloud-gpus is a great place to see what’s on offer.

You’ll need to install the latest version of Transformers, PEFT, and bitsandbytes (and HQQ if you’re using that). Then, clone our repo and follow the README there. Running python train.py --help will show the available options. To train llama2-7b on the included alpaca dataset on two 24GB cards you might run:

python train.py --train_type qlora --dataset alpaca --batch_size 8 --gradient_accumulation_steps 2 --output_dir qlora_output --log_to wandb

We’ve crammed everything required into this single file to make it easier to see what is going on and modify things if necessary.

You should treat this script as an alpha/preview release. Whilst we’ve used it to successfully train a variety of practically useful models on a range of hardware, it’s still early days. If you’re not comfortable with testing and debugging models, we’d suggest holding off for a few months whilst the community more fully tests the approach.

Update: We’re excited to see that support is already underway in the Hugging Face ecosystem via changes to Accelerate, Transformers, TRL and PEFT. Our code has also been incorporated into the Axolotl finetuning library and used to train Mixtral and other models.

A first step

Originally we’d planned to show some benchmarks in this article, along with guidance based on the benchmark results about how to best take advantage of FSDP/QLoRA. However, we had to keep delaying posting this, because we kept making major improvements every few days! So we’ll follow up with a benchmarking and recommendations article in the coming weeks.

What we’ve shown here is just the first step. We have lots of ideas for improvements, and we’re sure that the open source community will have many other ideas we haven’t even thought of yet!

We hope that seeing this proof of concept, that it’s possible to scale resource-efficient QLoRA training across inexpensive gaming GPUs, will help bring more attention to the problem of bringing down the cost of model training. It’s in everyone’s interest to make AI more accessible – and to enable more people to not only consume, but also build, valuable models.

Footnotes

  1. Throughout this article “training” can refer to either pre-training, or fine-tuning.↩︎

  2. Strictly speaking, we don’t generally store the actual activations, but rather just the intermediate pieces needed to recalculate them on demand by using gradient checkpointing.↩︎

  3. FSDP only supports floating point types but most quantization libraries store quantized weights in integer types. A selectable storage datatype resolved this discrepancy.↩︎

  4. FSDP only sypports syncing PyTorch parameters and buffers, while most quantization libraries store “quantization state” metadata in dictionaries.↩︎

  5. This is very important as calibration data bias is another major issue one can face using these data dependent methods.↩︎