Table of Contents

Best GPUs For Local LLMs

Return to GPUs, Local LLMs

https://techtactician.com/best-gpu-for-local-llm-ai-this-year

Out of all the things that you might want in a GPU used for both training AI models and model inference, the amount of available video memory is among the most important ones.

The amount of VRAM, max clock speed, cooling efficiency and overall benchmark performance.

VRAM

VRAM – For training AI models, fine-tuning and doing any AI calculations on large batches of data at the same time efficiently, you need as much VRAM as you can get. In general, you don’t want less than 12GB of video memory, and the models with 24GB of video memory would be ideal if only you can afford to get them. While for instance when training LoRA models for Stable Diffusion having less VRAM can just make the process much slower than it would be if you were able to process larger data batches at once, when using open-source large language models on your system the lack of VRAM can completely lock you out from being able to use many larger sized higher quality text generation models.

Max clock speed

Max clock speed – This is pretty self-explanatory: the faster the GPU can make calculations, the faster you can get through the model training process, however this takes us straight to the next point.

Cooling efficiency

Cooling efficiency – Fast calculation equals generating a large amount of heat that needs to be dissipated. Once the GPU hits its thermal limit and isn’t able to cool itself quickly enough, the thermal throttling starts – the main clock slows down, so as not to cook up your precious piece of smart metal.

My take is: try to always aim for 3 fan GPUs with cooling efficiency benchmark proof. In my experience when it comes to graphics cards mentioned in this list (which are pretty much the top shelf), cooling is hardly ever an issue.

Overall benchmark performance – Once again, whenever in doubt, check the benchmark tests which luckily these days are available all over the internet. These are useful mostly when comparing two high end cards with each other.

https://techtactician.com/best-gpu-for-local-ai-software-this-year


Oobabooga WebUI, koboldcpp, in fact, any other software made for easily accessible local LLM model text generation and chatting with AI models privately have similar best-case scenarios when it comes to the top consumer GPUs you can use with them to maximize performance. Here is my benchmark-backed list of 6 graphics cards I found to be the best for working with various open source large language models locally on your PC.

Note: The cards on the list are ordered by their price. Read the descriptions for info regarding their performance!

Contrary to popular belief, for basic AI text generation with a small context window you don’t really need to have the absolute latest hardware – check out my tutorial here! Running open-source large language models locally is not only possible, but extremely simple. If you’ve come across my guides on the topic, you already know that you can run them on GPUs with less than 8GB VRAM, or even without having a GPU in your system at all! But running the models isn’t quite enough. In an ideal world you want to get responses as fast as possible. For that, you need a GPU that is up for that task.

So, what are the things you should be looking for in a graphics card that is to be used for AI test generation with LLMs? One of the most important answers to this question is – a high amount of VRAM.

VRAM is the memory located directly on your GPU which is used when your graphics card processes data. When you run out of VRAM, the GPU has to “outsource” the data that doesn’t fit in its own memory to the main system RAM. And this is when trouble begins.

While your main system RAM is also very fast (in fact, in many cases just as fast as your GPU VRAM), the issue is that the time required to send the data from the GPU to system RAM and back is the thing that causes extreme slowdowns when the VRAM on your graphics card runs out.

Running out of VRAM is not only a problem that you might encounter when using LLMs, but also when generating images with Stable Diffusion, doing AI vocal covers for popular songs (see my guide for that here), and many other activities involving locally hosted artificial intelligence models.

There are also many other variables that count here. The number of tensor cores, amount and speed of cache memory and memory bandwidth of your GPU are also crucial. However, you can rest assured that all of the GPUs listed below meet the conditions that make them top-notch choices in terms of the usage with various AI models. If you want to learn even more about the technicalities involved, check out this neat explainer article here!

1. NVIDIA GeForce RTX 4090 24GB – Performance Winner

NVIDIA GeForce RTX 4090 24GB – Performance Winner

https://www.amazon.com/s?k=NVIDIA+GeForce+RTX+4090

https://www.ebay.com/sch/i.html?_from=R40&_nkw=NVIDIA+GeForce+RTX+4090

For:

Against:

The NVIDIA GeForce RTX 4090 is right now unbeatable when it comes to speed and it comes with 24GB of VRAM on board, which is currently the highest value you can expect from consumer GPUs. However.

It’s pricey, and as some (including me) would say, overpriced. This is just how it is with the NVIDIA graphics cards these days, and sadly it’s not likely to change anytime soon. With that out of the way, it has to be said that if you have the money to spare, this is the go-to card that you want to go for if you want the absolute best performance on the market, including the highest amount of VRAM accessible by regular means.

The 4090 will be able to grant you near instant generations when it comes to most basic locally hosted LLM models with the right configuration, local Stable Diffusion based image generation software, and best performance when it comes to model training and fine-tuning.

Performance-wise, it’s the best card on this list. Cost-wise, well, some things never change. Regardless, it wouldn’t be right not to begin with the RTX 4090, which is currently the most sought-after NVIDIA GPU on the market. With that said, let’s now move to some more affordable options, as there are quite a few to choose from here!

2. NVIDIA GeForce RTX 4080 16GB

2. NVIDIA GeForce RTX 4080 16GB

For: Performance still above average. Price is just a little bit lower. You can find good second-hand deals on this one. Against: The price is still far from ideal. Noticeably worse benchmark scores compared to the 4090. “Only” 16 GB of VRAM on board.

When it comes to the NVIDIA GeForce RTX 4080, in terms of sheer performance it takes the second place on my grand GPU top-list, right after the 4090. It does this however, with a much lower amount of available video memory on board.

The 16GB of GDDR6X VRAM is the highest amount of video memory you can get when buying the 4080, and while it certainly isn’t the dreaded 8GB of VRAM on older graphics cards, you could do better at a much lower price point, for instance with the RTX 3090 Ti 24GB, which I’m going to talk about in a very short while.

Again, if you have the money to spare and you’re already looking at GPUs in the higher price range, if I were you I would rather go for the 4090, or scroll down this list to find more cost-effective options which as it turns out, don’t fall that far from these two top models when it comes to their computational abilities.

Overall, the 4080 is still a card that is very much worth looking into, even more so if you can find it used for a good price (for instance here on Ebay). But there is one more proud representative of the 4xxx series I want to show you, one that is going head to head with the already mentioned models, while being much more reasonably priced.

3. NVIDIA GeForce RTX 4070 Ti 12GB

3. NVIDIA GeForce RTX 4070 Ti 12GB

For: Still a pretty good choice performance-wise. There are lot of great deals for it out there. Against: 12GB of VRAM can be too little for some use cases and applications. Less efficient than the 4080 and the 4090 computation-wise.

Now comes the time for the NVIDIA GeForce RTX 4070, taking the 3rd place on the podium. Is 12GB of VRAM too little? Here is what I have to say about it.

As I’ve already said, if you’re planning to train AI models and in general make your GPU do complex math on large data batches, you want to have access to as much video memory as possible. However, what also matters here, is the computation speed. And quite surprisingly when it comes to that, the 4070 Ti again does not fall very far from the first two cards on this list, in the end being approximately ~20% slower than the GeForce 4080.

While 12GB of VRAM certainly isn’t as much as the 24GB you can get with the 4090 or the 3090 Ti, it doesn’t really disqualify the 4070 Ti here. In my area, you can get the 4070 Ti for much less than both the 4080 and 4090, and that’s primarily what made me consider picking it up for my local AI endeavors. Still there are a few other options out there, including my personal pick, which is the next card on this list.


Best GPUs For Local LLMs In 2024

Updated: May 13, 2024

Oobabooga WebUI, koboldcpp, in fact, any other software made for easily accessible local LLM model text generation and chatting with AI models privately have similar best-case scenarios when it comes to the top consumer GPUs you can use with them to maximize performance. Here is my benchmark-backed list of 6 graphics cards I found to be the best for working with various open source large language models locally on your PC. Read on!

Contents hide 1 What Are The GPU Requirements For Local AI Text Generation? 2 How Much VRAM Do You Really Need? 3 Is 8GB Of VRAM Enough For Playing Around With LLMs? 4 Should You Consider Cards From AMD? 5 Can Your Run LLMs Locally On Just Your CPU? 5.1 1. NVIDIA GeForce RTX 4090 24GB 5.2 2. NVIDIA GeForce RTX 4080 16GB 5.3 3. NVIDIA GeForce RTX 4070 Ti 12GB 5.4 4. NVIDIA GeForce RTX 3090 Ti 24GB – Most Cost-Effective Option 5.5 5. NVIDIA GeForce RTX 3080 Ti 12GB 5.6 6. NVIDIA GeForce RTX 3060 12GB – The Best Budget Choice And here you can find the best GPUs for the general AI software use – Best GPUs For AI Training & Inference This Year – My Top List

Note: The cards on the list are ordered by their price. Read the descriptions for info regarding their performance!

This web portal is reader-supported, and is a part of the Amazon Services LLC Associates Program and the eBay Partner Network. When you buy using links on our site, we may earn an affiliate commission!

What Are The GPU Requirements For Local AI Text Generation? A basic chat conversation with an AI model using the OobaBooga text generation WebUI. Contrary to popular belief, for basic AI text generation with a small context window you don’t really need to have the absolute latest hardware – check out my tutorial here! Running open-source large language models locally is not only possible, but extremely simple. If you’ve come across my guides on the topic, you already know that you can run them on GPUs with less than 8GB VRAM, or even without having a GPU in your system at all! But running the models isn’t quite enough. In an ideal world you want to get responses as fast as possible. For that, you need a GPU that is up for that task.

So, what are the things you should be looking for in a graphics card that is to be used for AI test generation with LLMs? One of the most important answers to this question is – a high amount of VRAM.

VRAM is the memory located directly on your GPU which is used when your graphics card processes data. When you run out of VRAM, the GPU has to “outsource” the data that doesn’t fit in its own memory to the main system RAM. And this is when trouble begins.

While your main system RAM is also very fast (in fact, in many cases just as fast as your GPU VRAM), the issue is that the time required to send the data from the GPU to system RAM and back is the thing that causes extreme slowdowns when the VRAM on your graphics card runs out.

Running out of VRAM is not only a problem that you might encounter when using LLMs, but also when generating images with Stable Diffusion, doing AI vocal covers for popular songs (see my guide for that here), and many other activities involving locally hosted artificial intelligence models.

There are also many other variables that count here. The number of tensor cores, amount and speed of cache memory and memory bandwidth of your GPU are also crucial. However, you can rest assured that all of the GPUs listed below meet the conditions that make them top-notch choices in terms of the usage with various AI models. If you want to learn even more about the technicalities involved, check out this neat explainer article here!

How Much VRAM Do You Really Need? NVIDIA RTX 2070 SUPER with the OobaBooga WebUI. Here are my generation speeds on my old NVIDIA RTX 2070 SUPER, reaching up to 20 tokens/s using the OobaBooga text generation WebUI. The straightforward answer is: as much as you can get. The facts however are, that when it comes to consumer-grade graphics cards, for now there aren’t really many cards with more than 24GB of VRAM on board. If you want the absolute best, you should aim for these ones. An example of such a card on the high-end would be the NVIDIA GeForce RTX 4090 which I’ll cover in a short while.

The only other viable way to get more operational VRAM is to either connect multiple GPUs to your system (which requires both some technical skills and the right base hardware). In general though, 24GB of VRAM on a GPU will be able to handle most larger models you throw at them and is more than enough for most applications!

Is 8GB Of VRAM Enough For Playing Around With LLMs? Yes, you can run some smaller LLM models even on a 8GB VRAM system, and as a matter of fact I did that exact thing in this guide on running LLM models for local AI assistant roleplay chats, reaching speeds for up to around 20 tokens per second with small context window on my old trusted NVIDIA GeForce RTX 2070 SUPER (~short 2-3 sentence message generated in just a few seconds). You can find the full guide here: How To Set Up The OobaBooga TextGen WebUI – Full Tutorial

While you certainly can run some smaller and lower-quality LLMs even on an 8GB graphics card, if you want higher output quality and reasonable generation speeds with larger context windows, you should really only consider cards having between 12 and 24GB of VRAM – and these are exactly the cards I’m about to list out for you!

Should You Consider Cards From AMD? Our NVIDIA RTX 3070 Ti testing unit. In most cases, especially if you’re a beginner when it comes to local AI and deep learning, it’s best to pick a graphics card from NVIDIA rather than AMD. Here is why. This might be a tricky question for some. While AMD cards are certainly cheaper than the ones sold by NVIDIA (in most cases anyway), they are also known for certain driver and support issues that you might want to avoid, especially when dabbling in locally hosted AI models, not to mention the lack of CUDA support which makes AMD cards substantially slower in many AI-related applications. They are simply not great for ideal out-of-the-box experience with what we’re doing here, at least in my honest opinion.

Moreover, you should also know that many pieces of software such as the Automatic1111 WebUI, or the OobaBooga WebUI for text generation (and more), have different installation and configuration paths for AMD GPUs, and their support for the graphics cards other than the ones manufactured by NVIDIA is oftentimes rather bad. If you’re afraid of spending a lot of time troubleshooting your new setup, it’s best to stick with NVIDIA – trust me on this one.

Can Your Run LLMs Locally On Just Your CPU? Intel I7-13700KF processor installed on a motherboard, closeup shot. Gpt4All lets you run many open-source LLM models on your CPU. In that case the models are loaded directly into the main system RAM. In most cases CPU inference is slower compared to when using a GPU. Yes! And one of the easiest ways to do that is to use the free open-source GPT4ALL software which you can use for generating text using AI without even having a GPU installed in your system.

Of course, keep in mind that for now, CPU inference with larger, higher quality LLMs can be much slower than if you were to use your graphics card for the process. But yes, you can easily get into simpler local LLMs, even if you don’t have a powerful GPU.

Now let’s move on to the actual list of the graphics cards that have proven to be the absolute best when it comes to local AI LLM-based text generation. Here we go!

1. NVIDIA GeForce RTX 4090 24GB NVIDIA GeForce RTX 4090 24GB graphics card. For now, the NVIDIA GeForce RTX 4090 is the fastest consumer-grade GPU your money can get you. While it’s certainly not cheap, if you really want top-notch hardware for messing around with AI, this is it.

The 24GB version of this card is without question the absolute best choice for local LLM inference and LoRA training if you only have the money to spare. It can offer amazing generation speed even up to around ~30-50 t/s (tokens per second) with right configuration. This guy over on Reddit even chained 4 of these together for his ultimate rig for handling even the most demanding LLMs. Check the current prices of this beautiful beast here!

With the clear and rather unsurprising winner out of the way, let’s move on to some more affordable options, shall we?

See The RTX 4090 On Amazon! See The RTX 4090 On Ebay!

2. NVIDIA GeForce RTX 4080 16GB

NVIDIA GeForce RTX 4080 16GB

NVIDIA GeForce RTX 4080 16GB graphics card. The NVIDIA GeForce RTX 4080 comes right after the 4090 when it comes to performance. Where it lack however, is the VRAM department.

While there is a pretty notable performance gap between the 4080 and the 4090, the most important difference between these two cards is that the GeForce RTX 4080 maxxes out at 16GB of GDDR6X VRAM, which is significantly less than its successor has to offer.

As we’ve already established, for running large language models locally ideally you want as much VRAM as you can possibly get. Just because of that, the RTX 4080 would not be my first choice when picking a graphics card for that very purpose with a decent budget. Still, the 4080 offers great way-above-average performance and can yield surprisingly good results when it comes to text generation speed. It just won’t fit some larger LLM models which you could run without trouble on its older brother.

See The RTX 4080 On Amazon! See The RTX 4080 On Ebay!

3. NVIDIA GeForce RTX 4070 Ti 12GB

NVIDIA GeForce RTX 4070 Ti 12GB

NVIDIA GeForce RTX 4070 Ti 12GB 16GB graphics card. The NVIDIA GeForce RTX 4070, while having even less VRAM than the 4080, is just a little bit more affordable than my first two picks, and it’s still one of the best performing GPUs on the market as of now.

This card, still making the overall top list of GPUs you can get this year, offers about two times the performance of the RTX 3060, and it does so for a pretty good price. If you can make do with 12GB of VRAM, this might just be a good choice for you.

This card, being perfectly honest is in a little bit of a weird place when it comes to its LLM use reliability. It doesn’t exactly give you a large amount of VRAM, when it comes to both benchmark and real-life performance it visibly falls behind the 4080 and the 4090, and sadly its price doesn’t seem to reflect that yet. If you’re looking for a better price/performance ratio, consider checking out the 3xxx series that I’m about to show you.

See The RTX 4070 Ti On Amazon! See The RTX 4070 Ti On Ebay! Want the absolute best graphics cards available this year? – I’ve got you covered! – Best GPUs To Upgrade To These Days (My Honest Take!)

4. NVIDIA GeForce RTX 3090 Ti 24GB – Most Cost-Effective Option

NVIDIA GeForce RTX 3090 Ti 24GB – Most Cost-Effective Option

NVIDIA GeForce RTX 3090 Ti 24GB graphics card. With the NVIDIA GeForce RTX 3090 Ti, we’re stepping down from the price even more, but surprisingly, without sacrificing much performance. The 3090 alongside with the 3080 series are still among the most commonly chosen GPUs for LLM use.

In my personal experience confirmed by recorded user benchmarks, the 3090 Ti performance-wise comes right after the already mentioned 4070 Ti. When it comes to the price, this latest GPU from the NVIDIA 3xxx series is probably one of the best pieces of hardware on this list. The newest 4xxx generation of NVIDIA cards is still pretty overpriced, but the older models have already started slowly dropping prices with the end of the previous year.

Learn more about this GPU here: NVIDIA GeForce 3090/Ti For AI Software – Is It Still Worth It?

So in other words, both the original 3090 (offering just a tad bit less performance and the same amount of video memory) and the 3090 Ti are the most cost-effective graphics cards on this list. If you absolutely don’t want to overpay, you can also get one of these second-hand. You can find quite a few 3090’s on Ebay for a very good price!

See The RTX 3090 Ti On Amazon! See The RTX 3090 Ti On Ebay!

5. NVIDIA GeForce RTX 3080 Ti 12GB

NVIDIA GeForce RTX 3080 Ti 12GB

NVIDIA GeForce RTX 3080 Ti 12GB graphics card. After the 3090 Ti, quite naturally comes its predecessor, the NVIDIA GeForce RTX 3080 Ti. This GPU while having only 12GB of VRAM on board, is still a pretty good choice if you’re able to find a good deal for it.

The 3080 Ti and the 3090 Ti when it comes to their specs and real-world performance are really close together. When it comes to the on-board VRAM however, the 3090 Ti easily comes off as a better choice. With the little performance boost and a TDP larger by 100 watts, the 3080 Ti is in my eyes only worth it if you can find it used for cheap.

If you can, grab the 3090 Ti, or a base 3090 instead. If the price isn’t substantially better, there is no good choice to stick with the previous model, mainly because of the lesser amount of VRAM it has to offer. Now let’s move on to the real budget king which you might have been waiting for!

See The RTX 3080 Ti On Amazon! See The RTX 3080 Ti On Ebay!

6. NVIDIA GeForce RTX 3060 12GB – The Best Budget Choice

6. NVIDIA GeForce RTX 3060 12GB – The Best Budget Choice

NVIDIA GeForce RTX 3060 12GB graphics card. The NVIDIA GeForce RTX 3060 with 12 GB of VRAM on board and a pretty low current market price is in my book the absolute best tight budget choice for local AI enthusiasts both when it comes to LLMs, and image generation.

I can already hear you asking: why is that? Well, the prices of the RTX 3060 have already fallen quite substantially, and its performance as you might have guessed did not. This card in most benchmarks is placed right after the RTX 3060 Ti and the 3070, and you will be able to most 7B or 13B models with moderate quantization on it with decent text generation speeds. With right model chosen and the right configuration you can get almost instant generations in low to medium context window scenarios!

As always, you can also look at some used GPU deals on Ebay when it comes to previous-gen graphics cards like this one! Finding a right one can make your purchase even more budget-friendly!

See The RTX 3060 12GB On Amazon! See The RTX 3060 12GB On Ebay!

You might also like: Best GPUs For AI Training & Inference This Year – My Top List

Tagsaiartificial intelligencegenerative aiguideshardwarelarge language models

https://techtactician.com/best-gpu-for-local-llm-ai-this-year

GPUs: GPU - NVIDIA GeForce RTX 3060 12GB, NVIDIA,

NVIDIA GeForce RTX 4090 24GBGPU Performance Winner, NVIDIA GeForce RTX 4080 16GB, NVIDIA GeForce RTX 4080 16GB, NVIDIA GeForce RTX 4070 Ti 12GB, NVIDIA GeForce RTX 3090 Ti 24GBGPU Most Cost-Effective Option, NVIDIA GeForce RTX 3080 Ti 12GB, NVIDIA GeForce RTX 3060 12GBGPU Best Budget Choice

Awesome GPUs. (navbar_gpu, navbar_chatbot)

navbar_gpu

LLM: Large Language Models (LLMs), Alpaca, Retrieval Augmented Generation (RAG, Awesome LLMs. (navbar_llm - see also navbar_chatbot, navbar_chatgpt, navbar_nlp, navbar_ai, navbar_dl, navbar_ml)