A local LLM is not one product. It is a stack: model weights, a file format, an inference engine, a desktop app or server, and the hardware ceiling underneath all of it. Once you see that, the landscape stops looking like a noisy list of apps and starts looking like a set of sane engineering choices: what model can this machine hold, what runtime can execute it, how do I want to interact with it, and am I only doing inference or also training?
Why run an LLM locally
The most concrete reason to run a local LLM is that the tokens never have to leave your machine. If you are summarizing internal documents, searching a private codebase, or building a tool that touches customer data, local inference can remove an entire class of “should we upload this?” questions. That does not magically solve security, your endpoint still needs to be secured, and the model can still leak data to logs or downstream tools, but it does remove a third-party API from the path.
Cost is the second big reason, and it shows up differently depending on usage. For occasional chat, hosted APIs are often cheaper because your laptop and time are not free. For repeated workloads, coding help all day, document extraction across a large archive, a small internal assistant used by one team, the economics can flip. Our piece on local LLM coding covered one extreme example: a quantized 14B Qwen model on a single RTX 5060 Ti, wrapped in the ATLAS orchestration pipeline, reached 74.6% pass@1 on one LiveCodeBench slice with no API calls. The exact benchmark comparison there is not apples-to-apples, but the useful lesson is simpler: for some workloads, a modest local setup can be good enough often enough that the API premium starts to look like a convenience fee.
Offline use matters more than people expect. A local LLM works on a plane, in a lab with restricted connectivity, in an on-prem environment, or during an API outage. It also avoids rate limits and vendor-side product changes. If you have ever had a workflow quietly worsen because a hosted interface changed prompts, model routing, or context limits, local control is appealing partly because it is boring. You pick the model version. It stays there until you change it.
Latency can also improve, but only in a specific band. For small and medium quantized models that fit comfortably in GPU memory, local response can feel snappy because there is no network round trip and no shared remote queue. For larger models on weak hardware, the opposite is true: local can be painfully slow. A small model on your MacBook may answer before a frontier API. A too-large model paging between RAM and storage will make you question your life choices.
The honest counterpoint is that hosted models still win plenty of comparisons. Frontier APIs are usually stronger on reasoning, multimodal breadth, tool integrations, long context, reliability under load, and plain old convenience. They also avoid the friction of drivers, model downloads, storage management, and format compatibility. Running locally is not “better.” It is better when privacy, control, offline use, predictable cost, or system customization matter enough to justify the setup.
There is also a less obvious advantage: local setups teach you where LLM performance actually comes from. You stop asking “what’s the best app?” and start asking more useful questions, such as:
- What model size can my hardware support?
- Do I need a GUI, a CLI, or an API endpoint?
- Am I using GGUF quantized weights for inference or safetensors for training?
- Is my bottleneck weights, KV cache, CPU offload, or plain storage I/O?
That is the right mental model for the rest of this guide.
The local inference stack
The local inference stack has five layers, and mixing them up causes most beginner confusion.
Layer 1 is the model itself: Llama-family models, Qwen, Gemma, Mistral, DeepSeek derivatives, coding models, instruction-tuned variants, and so on. This is the learned behavior in the weights.
Layer 2 is the file format those weights are stored in. The big split you will run into is GGUF versus safetensors. Hugging Face’s documentation describes GGUF as a binary format optimized for fast loading and saving for inference, designed for GGML-style executors and carrying both tensors and standardized metadata. safetensors, by contrast, is a tensor-focused format widely used in PyTorch ecosystems and training pipelines.
Layer 3 is the inference engine that can execute those weights. The reference point here is llama.cpp, which its maintainers position as an inference framework for Llama and other models on a wide range of hardware. A remarkable amount of the laptop-scale local space is built on top of it or borrows its assumptions. If you want maximum control, this is where you end up.
Layer 4 is the user-facing runtime: the thing you actually launch. This is where tools such as Ollama, LM Studio, Jan, and GPT4All live. Ollama’s own site presents it as the easiest way to run open models locally. LM Studio’s site presents it as a local desktop app for running models privately, with a built-in server. Those are vendor descriptions, but they match the broad shape of the tools.
Layer 5 is the serving layer: exposing the model to other software, usually over an OpenAI-compatible HTTP API. For a personal laptop workflow, the app and serving layer are often the same thing. For production, they are often separated, which is where tools like vLLM enter the picture. The vLLM documentation describes it as an inference and serving engine aimed at high-throughput LLM serving.
A few concrete examples make the stack click:
- Ollama is the simplest default for many people. It is CLI-first, runs models locally, and exposes a local server.
- LM Studio is the easiest GUI-first path. The useful bit is the built-in model browser, chat interface, and local server. It is beginner-friendly and polished. It is also closed-source.
- llama.cpp is for people who want to know what is happening and are willing to trade convenience for control. You choose backends, quantizations, launch flags, batching, and more.
- Jan and GPT4All are open-source alternatives in the desktop-app bucket. They matter if you want a GUI but do not want to commit to LM Studio’s closed-source approach.
- vLLM is not really a hobbyist desktop runner. It is a serving engine for higher-throughput inference, especially in production environments where batching, scheduling, and efficient GPU utilization matter.
A dry but useful observation: people often compare these tools as if they are substituting for each other at the same layer. They are not. Comparing LM Studio to llama.cpp is like comparing a dashboard to an engine. Comparing Ollama to vLLM is comparing a local convenience layer to a serving system designed for throughput.
That stack view also explains why a tool can “support the same model” and still behave differently. The weights may match, but quantization, prompt formatting, context settings, backend acceleration, batching policy, and chat template handling can all change the outcome.
If you only need one practical recommendation, it is this:
- Beginner: start with Ollama or LM Studio.
- Power user: learn
llama.cpp. - Production serving: evaluate vLLM.
- Training plus inference in one place: look at Unsloth and our coverage of Unsloth Studio packs local LLM training into one app.
What hardware you actually need
Hardware discussions get fuzzy very quickly, so it helps to separate four resources: VRAM, system RAM, storage, and compute throughput. For local inference, VRAM is usually the binding constraint on GPU systems. If the model and its runtime state fit in VRAM, life is good. If they spill into system RAM, performance drops. If they spill further and page aggressively, performance falls off a cliff.
The first thing to account for is weights. A 7B parameter model in 16-bit precision is much larger than the same model quantized to 4-bit. That is why quantization exists at all: fewer bits per weight means less memory and a smaller download. A second chunk of memory goes to the KV cache, which stores attention state for the tokens already processed. Longer contexts need larger KV caches. The third chunk is runtime overhead: temporary buffers, framework allocations, and whatever the backend needs.
That means “can I run this model?” is not answered by a single file size.
A useful mental shortcut is:
- weights decide whether the model loads at all;
- KV cache decides whether your chosen context stays practical;
- runtime overhead decides whether the whole thing remains comfortable.
For rough memory math, a quantized model’s on-disk file size is often a decent first approximation for the in-memory weight footprint, but it is only a first approximation. Runtime choice, backend, batch size, and context length all add overhead. That is why “fits” and “practical” are different questions.
Here is a tighter, approximate guide for common local model classes:
| Model class | Typical quantization | Approx weight footprint | Where it usually fits | When it is practical |
|---|---|---|---|---|
| 7B class | Q4 | ~4GB to 5GB | 8GB GPU, 16GB Mac, many CPU-only laptops | Good starter tier for chat, coding snippets, document tasks |
| 8B class | Q5 | ~5GB to 6.5GB | 8GB GPU, 16GB+ Mac, 16GB+ RAM CPU systems | Often the sweet spot for first serious local use |
| 14B class | Q4 | ~8GB to 10GB | 12GB to 16GB GPU, 32GB+ unified memory Macs | Very practical daily-driver tier if context stays reasonable |
| 14B class | Q5 | ~10GB to 12GB+ | 16GB GPU, 32GB+ Macs | Better quality retention, less headroom for long context |
| 32B class | Q4 | ~18GB to 22GB+ | 24GB GPU, larger unified memory systems, servers | Usually only pleasant on roomy hardware |
| 32B class | Q5/Q8 | ~22GB to 35GB+ | 24GB+ GPU to workstation/server | More of a workstation or server decision than a casual laptop one |
Those ranges are deliberately approximate. They depend on runtime, quantization scheme, context length, batch size, and allocator overhead. A Q4_K_M build in one runner is not exactly the same operationally as another 4-bit variant in a different stack.
A second table makes the fits versus practical distinction more explicit:
| Hardware tier | Example fit range | Usually practical range | What breaks first |
|---|---|---|---|
| CPU-only laptop, 16GB RAM | 3B to 7B GGUF, sometimes 8B | 3B to 7B with modest context | Prompt ingestion speed and long generations |
| Apple Silicon, 16GB unified memory | 7B Q4/Q5, some 8B variants | 7B to 8B for chat and light coding | Long context and throughput under sustained use |
| GPU with 8GB VRAM | 7B Q4, some 8B Q4/Q5 | 7B to 8B with reasonable context | 14B models may load only with compromises and then feel slow |
| GPU with 12GB to 16GB VRAM | 8B Q5, 14B Q4/Q5 | 8B to 14B daily-driver tier | Context growth and concurrent requests |
| GPU with 24GB VRAM | 14B roomy, 32B Q4 possible | 14B to smaller 32B quantized models | Long context, larger batch sizes, serving more than one user |
| Workstation / server | 32B+ quantized and higher throughput | Team use, multi-user serving, larger contexts | Ops complexity, not raw fit |
A worked example is more useful than another hand-wavy table.
Example 1: 8GB VRAM GPU.
Think RTX 4060-class territory. This is where 7B or 8B models in Q4 or Q5 quantization make sense. A Qwen2.5 7B or Gemma-class 7B/8B model in a compact quantization is realistic. You can run coding or chat models locally and have a good time, but 14B models become much more conditional. Maybe they fit with aggressive quantization and modest context. Maybe they spill and become miserable. This is the tier where “the model technically launches” and “the model is practical” diverge.
Example 2: 16GB VRAM GPU.
This is a sweet spot. 14B Q4/Q5 models become realistic, and smaller models have room for longer contexts or better quantization. Qwen2.5-Coder 14B in a sensible quantization is the kind of thing people actually use all day on this class of hardware. If your goal is serious daily use for coding, private document work, or an internal assistant, this tier is where local inference stops feeling like a stunt and starts feeling like infrastructure.
Example 3: Apple Silicon with 32GB unified memory.
This is different because the CPU and GPU share the same memory pool, accelerated through Apple’s Metal stack. Apple’s Metal documentation is the canonical source for the acceleration layer itself, not for “this Mac runs that model at X tokens per second.” In practice, though, Apple Silicon has become a very real path for local inference because a Mac with 32GB or 64GB unified memory can accommodate larger quantized models than a discrete-GPU spec sheet alone might suggest. The catch is throughput: fitting is not the same as being fast. Apple laptops can be excellent local LLM machines, but you still have to match the model to the workload.
The KV cache is the part many first-time users miss. A model that feels fine at 4k context may become awkward at 32k because the cache grows with prompt length. This is why a setup that is comfortable for chat can feel much worse for document Q&A, codebase indexing, or agent loops that keep long histories around.
CPU-only fallback matters too. If you have lots of RAM but no useful GPU, you can still run quantized GGUF models through llama.cpp, Ollama, LM Studio, or GPT4All. It will often be slower, especially on prompt ingestion and long generations, but it works. For occasional use, that can be enough.
Storage is the part people forget. Model files are large. A few 7B and 14B variants plus quantizations can consume tens or hundreds of gigabytes surprisingly quickly. Fast SSD storage helps with downloads and load times, even when it does not fix inference throughput.
Phones are real, but constrained. Our local AI node on Xiaomi 12 Pro covered a claimed headless phone setup serving Gemma through Ollama on a Snapdragon 8 Gen 1 device. The hardware is real; the exact setup details were not independently verified. The useful takeaway is not “your phone is a server now.” It is that small local inference is possible on modern mobile hardware if you accept aggressive thermal limits, memory constraints, and operational weirdness.
The catch: if you are wondering why people obsess over VRAM instead of TOPS marketing numbers, this is why. Marketing numbers do not tell you if the model fits. Memory usually does.
Model formats and quantization
If you download open model weights from Hugging Face, the most important distinction is usually GGUF for inference versus safetensors for training and framework-native use.
Hugging Face documents GGUF as a format optimized for quick loading and saving, designed for inference executors such as llama.cpp, and carrying standardized metadata alongside the tensors. That metadata is more useful than it sounds. It can include tokenizer information, architecture details, quantization parameters, and chat-template-relevant details that help runners load the model correctly.
safetensors is the format you will often see in training-oriented repositories and PyTorch workflows. It is a safer and simpler tensor container for framework-native weights. If you are doing fine-tuning, merging adapters, or using serving stacks built around framework-native weights, safetensors is common. vLLM, for example, generally lives in that framework-native serving world rather than the GGUF desktop-inference world.
Here is the compact version:
| Format | Primary use | Common runtimes | Metadata behavior | Typical training use | Typical serving use | Usual conversion direction |
|---|---|---|---|---|---|---|
GGUF |
Local inference artifact | llama.cpp, Ollama, LM Studio, GPT4All |
Rich standardized metadata for tokenizer/config/quantization | Rarely the source format for training | Common for desktop and edge inference | Usually exported from safetensors/framework-native weights |
safetensors |
Source weights for training and framework-native inference | PyTorch ecosystems, Transformers, many vLLM deployments | Tensor-focused container; config/tokenizer often travel as separate files in the repo | Common base format for LoRA/QLoRA and adapter workflows | Common in server stacks and framework-native inference | Often converted to GGUF for laptop-friendly inference |
That gives you a simple rule:
- Want easy local inference on a laptop? Start with GGUF.
- Want training, adapter work, or framework-native serving? Expect safetensors.
Why do these spaces split this way? Because they are optimizing for different things.
The GGUF world is trying to make single-file, portable, quantized inference easy across local runners. It is optimized around practical loading, quantization, and compatibility with llama.cpp-style executors. It is the format you hand to a local app when the goal is “run this on my machine.”
The safetensors world is trying to make training and framework-native serving sane. Training stacks want direct compatibility with PyTorch and Hugging Face tooling, adapters, checkpoint management, and the surrounding config files that define tokenizer and model behavior. It is the format you keep when the goal is “modify this model” or “serve it in a framework-native stack.”
That is why GGUF is usually an exported inference artifact, while safetensors often remains the source format for training and many serving stacks.
The next concept is quantization. This is the practice of representing model weights with fewer bits so the model uses less memory. In normal conversation, people say things like Q4, Q5, and Q8.
Very roughly:
- Q4 means around 4 bits per weight. Smallest memory footprint, usually easiest to fit, but with the most quality loss risk.
- Q5 uses more bits than Q4. It is a common compromise because it often preserves quality better while staying relatively compact.
- Q8 uses still more memory but tends to retain more of the original model quality.
The exact quantization scheme matters. A Q4_K_M file is not identical in behavior to every other “4-bit” file. Different methods preserve different tensors more carefully. This is why the filename soup on Hugging Face can look ridiculous until you realize it is encoding engineering trade-offs.
What do these mean in practice?
For many everyday uses, a good 7B or 8B model at Q4 or Q5 is much more useful than a larger model that barely fits. The larger model might have better raw capability, but if it runs at painful speeds, starves the rest of your machine, or forces you to cut context aggressively, the better benchmark number is not helping you. This is one of the central gotchas in local LLM land.
A rough practical read on common quantization choices:
- Q4: good starting point for limited VRAM. Useful on 8GB GPUs and lower-memory Macs.
- Q5: often the default “slightly nicer if you can afford it.”
- Q8: for smaller models when you want to preserve quality, or for larger-memory systems.
Where do you get these files? Hugging Face’s GGUF model filter is the cleanest starting point for GGUF models. The model card and file list usually tell you which quantizations are available. For safetensors, the regular model repository files will usually include them.
One underrated feature in Hugging Face’s GGUF support is visibility into metadata and tensor info. That matters because local model files are not all equally well packaged. Good metadata reduces guesswork.
A practical gotcha: chat quality differences are not always from the quantization alone. They can come from using the wrong chat template, system prompt handling, context settings, or sampler defaults. People often blame “Q4 ruined the model” when the actual problem is the runtime loaded it with an incompatible template.
Run your first local model
The shortest path to success is to pick a small instruct-tuned model that definitely fits your machine and run it through a tool with a low-friction default. Do not start with the biggest model you have heard of. Start with the one that will answer before you get annoyed.
For most readers, that means Ollama or LM Studio.
The Ollama happy path
Ollama’s appeal is that the install-to-first-output loop is tiny. After installation, the core workflow is basically:
ollama run gemma3
Or another supported model tag. Ollama will download the model if needed, start it, and drop you into an interactive session.
That one command is why Ollama is such a common recommendation. It hides a lot of fiddly setup. It also runs a local service, which becomes useful later when you want to connect apps to it.
A more realistic first session looks like this:
ollama run qwen2.5-coder:7b
Then prompt it with something concrete:
Write a Python script that scans a directory, finds
.logfiles larger than 100MB, and prints the top 10 by size.
That is a better first test than “hello” because it tells you whether the model is following instructions, whether latency feels acceptable, and whether your chosen model is remotely appropriate for your use case.
If you want to list what you have downloaded:
ollama list
If you want to remove one because model hoarding is real:
ollama rm model-name
The LM Studio happy path
LM Studio is the GUI version of the same first-run idea. You install it, browse models, download a GGUF variant, load it, and chat. For someone who does not enjoy the terminal, this is genuinely useful.
The nice part is discoverability. LM Studio’s interface makes model selection less mysterious because you can see variants, download options, and serving controls in one place. The tradeoff is that you learn less about the underlying stack, because the app is intentionally doing that work for you.
If you are choosing between the two:
- pick Ollama if you like the terminal and expect to script things;
- pick LM Studio if you want a visual catalog and a lower-stress first hour.
There are open-source GUI alternatives too, notably Jan and GPT4All, if you want the GUI path without using LM Studio.
The actual “run locally” moment is anticlimactic in the best way. You install a runner, download a compatible model, and ask it to do work. The main failure modes are also boring:
- model too large for memory;
- wrong format for the chosen tool;
- poor defaults for context or template;
- unrealistic expectations from a too-small model.
That last one is worth stating plainly. A local 7B coding model can be extremely useful. It is not secretly a frontier reasoning model. The fastest way to enjoy local inference is to ask it to do the work it can do well.
Serve a local model over an API
The most important local LLM pattern after “run it on my laptop” is “make it look like an API so my existing tools can use it.” This is where the stack becomes much more powerful.
Many local runners expose an OpenAI-compatible API. Ollama documents a local API on its site, and LM Studio documents a local server mode as part of its app. That means they run a server on localhost, often on a port like 11434 or 1234 depending on the tool, and accept request shapes similar to OpenAI’s chat or completions endpoints. The exact endpoint paths and compatibility details vary, but the practical outcome is simple: a lot of software that expects an OpenAI-style API can be pointed at your machine instead.
This is why local LLMs are not just toys for chat windows. They can plug into editors, agent frameworks, internal tools, document pipelines, and eval harnesses without each tool needing a bespoke integration.
The shape usually looks like this:
# local server running on your machine
http://localhost:11434/v1
Then in your code or tool config, you set:
base_urlto your local serverapi_keyto a dummy value if required by the clientmodelto whatever local model name the server expects
A minimal Python example with an OpenAI-compatible client often looks roughly like:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="qwen2.5-coder:7b",
messages=[
{"role": "user", "content": "Write a bash script to back up a Postgres database."}
]
)
print(response.choices[0].message.content)
The exact model ID and port depend on your runner, but the pattern is stable.
This is also where the differences between local tools matter more.
Ollama is popular because it offers this bridge by default and keeps deployment friction low.
LM Studio also provides a local server, which is one reason it is useful beyond the chat UI.
Unsloth Studio, according to Unsloth’s product documentation, can also expose an OpenAI-compatible API as part of a broader run-train-export workflow.
vLLM belongs here too, but in a different class: its documentation focuses on production-style serving, including efficient batching and high-throughput inference.
If you are building something that serves more than one user or has to survive real traffic, the gotcha is straightforward: a local desktop runner and a production serving engine are not interchangeable. A laptop app exposing an API is perfect for development, prototypes, and single-user workflows. It is not automatically a robust multi-user backend.
A very practical use case is coding assistants. Many editor plugins and toolchains can be pointed to a local OpenAI-compatible endpoint instead of a cloud provider. That is the path from “I can chat with a local model” to “my editor, scripts, and agent tools all use a local model.” Our local LLM coding piece is useful here because it shows how much performance can come from the system around the model, not just the model weights.
A second gotcha: “OpenAI-compatible” rarely means “drop-in identical forever.” The things that tend to break first are predictable:
- Tool-calling schemas. One server may expect slightly different JSON shapes for tools or function calls than another.
- Embeddings endpoints. Some local servers support them, some do not, and some implement only parts of the request surface.
- Structured outputs. JSON mode, response formatting, or schema-constrained output can behave differently across runtimes.
- Streaming behavior. Event shapes and chunk timing can diverge, which matters if your client assumes a specific stream format.
- Model naming conventions. A client expecting
gpt-4o-mini-style names may need a local name likeqwen2.5-coder:7binstead.
For simple chat completions, compatibility is often good enough. For advanced features, test the exact path you care about.
The single-user versus production split is worth making explicit:
| Scenario | Good default | Why |
|---|---|---|
| One developer, one laptop, local tools | Ollama or LM Studio | Fastest path to a local server and easy model switching |
| Small internal prototype | Ollama, LM Studio, or a lightweight self-hosted runner | Good enough if traffic is low and failure is tolerable |
| Multi-user app or team service | vLLM or another production-grade serving stack | Throughput, batching, scheduler behavior, and ops matter more than convenience |
This is one of those cases where local simplicity is real, but only in the first tier. Once you move from localhost to “other people depend on this,” you are doing infrastructure again.
Fine-tuning versus inference
Running a model locally and changing a model locally are different jobs.
Inference means using existing weights to generate outputs. This is what Ollama, LM Studio, and most first local LLM setups are doing. You are loading a model, maybe choosing a quantized variant, and prompting it.
Fine-tuning means adapting the model using additional training data so it behaves differently. This is a deeper rabbit hole, both technically and operationally. You need datasets, evals, training settings, export paths, and enough hardware to make the process tolerable.
The most important fine-tuning concepts for local work are LoRA and QLoRA.
LoRA, short for Low-Rank Adaptation, adds a small number of trainable adapter weights on top of a base model instead of retraining all parameters. This is why local fine-tuning became feasible for ordinary hardware in the first place: you are updating a compact adapter, not the entire model.
QLoRA pushes this further by using quantized base weights during fine-tuning to reduce memory demands even more. The high-level idea is simple even if the internals are not: keep the big base model memory footprint down, train small adapters, and recover a practical path to customization on consumer GPUs.
This is where safetensors starts showing up more often than GGUF. Training pipelines commonly work with framework-native weights and adapters first, then export to other formats later if needed. Unsloth’s official site says its tools support local training, LoRA and QLoRA optimization, and export to both safetensors and GGUF for use with llama.cpp, vLLM, Ollama, and others. That export path is important because it connects the training world back to the inference world.
A minimal fine-tuning workflow usually looks like this:
- Choose a base model in
safetensors. - Prepare a dataset that matches the behavior you actually want.
- Train a LoRA or QLoRA adapter with settings that fit your hardware.
- Evaluate on held-out tasks that were not in the training set.
- Export or merge the result depending on how you plan to serve it.
- Optionally convert to
GGUFif the final destination is laptop-style local inference.
That fourth step is where many hobby fine-tunes go wrong. If you do not hold out tasks for evaluation, it is very easy to produce a model that merely imitates the training data format while getting worse at the real job.
A good way to think about local fine-tuning is that it is worth it when one of three things is true:
- you need style or behavior consistency that prompting alone does not reliably deliver;
- you need the model to specialize on a narrow domain or format;
- you need to compress a workflow into a smaller local model for privacy, cost, or latency reasons.
It is not usually worth it just because you want “better answers.” In many cases, a better base model, better prompts, retrieval, structured tools, or a runtime pipeline will give more value than a weekend spent producing a mediocre adapter.
This is one of those areas where systems thinking matters. The ATLAS example from our coding coverage is useful precisely because it showed a huge performance jump from orchestration around a frozen model. No fine-tuning at all. That does not make fine-tuning pointless; it just means you should be suspicious of using training to solve what is really a workflow problem.
Bad datasets usually hurt more than they help. A small, clean, task-shaped dataset can improve behavior. A noisy or poorly matched dataset can drag the model toward the wrong style, reduce general usefulness, and give you a model that feels oddly worse in day-to-day use.
If you want a GUI-first path into this world, Unsloth Studio packs local LLM training into one app is worth reading. If you want weird hardware inspiration, the Optane local LLM build article is a good reminder that memory architecture choices can matter a lot once you move beyond the simplest setups.
The catch: local fine-tuning is where people most often confuse “possible” with “worth doing.” A LoRA run completing on your GPU is not the same as producing a model that outperforms a better prompt and a retrieval step.
How to choose the right setup
The simplest way to choose a local LLM setup is to decide in this order:
- Goal
- Hardware ceiling
- Inference or training
- Format
- Runtime, app, or serving layer
That order saves a lot of wasted time. It also keeps this from turning into “pick the coolest app.”
Here is the decision matrix version:
| Goal | Hardware ceiling | Inference or training | Start with format | Start with runtime |
|---|---|---|---|---|
| Private chat / document Q&A | CPU-only laptop, MacBook, small GPU | Inference | GGUF | LM Studio or Ollama |
| Local coding help | 8GB to 16GB GPU, or roomy Mac | Inference | GGUF | Ollama first, then API integration |
| Local API for tools or agents | One-user laptop vs multi-user server | Inference / serving | GGUF for desktop, safetensors/framework-native for server stacks | Ollama/LM Studio for localhost, vLLM for production |
| Fine-tuning a specialized model | GPU with enough memory for adapter training | Training | safetensors | Unsloth or another LoRA/QLoRA stack |
| Throughput-oriented serving | Workstation/server | Inference / serving | usually safetensors or framework-native weights | vLLM |
The logic inside the table matters more than the names.
Path 1: private document Q&A on a MacBook Air
Start with the goal: you want private chat over documents.
Then hardware: say a MacBook Air with 16GB unified memory.
Then mode: inference, not training.
Then format: GGUF, because you want a laptop-friendly inference artifact.
Then runtime: LM Studio if you want a GUI, or Ollama if you want the shortest CLI path.
A sensible first setup is a 7B or 8B instruct model in Q4 or Q5. That is the kind of system that loads quickly, stays within memory, and is good enough to prove whether the workflow is useful. If you immediately jump to a larger model because benchmark charts made it look smarter, you may just buy yourself slower prompts and worse battery life.
Path 2: private coding on a 16GB GPU desktop
Start with the goal: local coding help in an editor.
Then hardware: 16GB VRAM.
Then mode: inference.
Then format: GGUF if you want fast local deployment through desktop runners.
Then runtime: Ollama is a good first move because it gives you both CLI use and a local API endpoint.
This is where 14B Q4/Q5 coding models become realistic daily drivers. The stack choice is not “best coding app.” It is:
- a coding-tuned base model that fits;
- a quantization that leaves some headroom;
- a local runtime with an API;
- an editor or toolchain pointed at
localhost.
That path is why local tooling feels coherent once you stop thinking app-first.
Path 3: local API for an internal prototype
Start with the goal: existing tools should call a local model.
Then hardware: maybe one machine used by a small team.
Then mode: inference and serving.
Then format: GGUF is fine if this is still really a desktop-scale deployment; safetensors/framework-native weights become more relevant if you are moving toward a real serving stack.
Then runtime: Ollama or LM Studio for one-user or low-traffic use, vLLM if concurrency, batching, and throughput matter.
The important distinction is not “local versus cloud.” It is localhost convenience versus multi-user service design.
Path 4: fine-tuning for a narrow task
Start with the goal: change model behavior, not just run it.
Then hardware: enough GPU memory for LoRA or QLoRA training.
Then mode: training.
Then format: safetensors.
Then runtime: a training stack such as Unsloth, followed by export to the format you need for inference.
This is the clearest example of why format comes before app choice. If the job is training, starting from GGUF is usually the wrong mental model. GGUF is often where you end up for deployment, not where you begin.
If your hardware is the main constraint
Pick by ceiling, not ambition.
- Weak laptop / CPU only: GGUF, small models,
llama.cpp-based tools. - 8GB VRAM GPU: 7B/8B Q4/Q5, Ollama or LM Studio.
- 16GB VRAM GPU: 8B/14B local daily-driver tier.
- 24GB+ VRAM: more room for 14B+ models, longer contexts, or multi-user serving.
- Apple Silicon: excellent general local path if you choose models conservatively.
- Phone / edge device: novelty and experimentation first, infrastructure second. The local AI node on Xiaomi 12 Pro article shows why this is interesting and why thermal reality shows up quickly.
If you are still unsure, here is the boring but strong default:
Start with Ollama, a 7B or 8B GGUF model that fits easily, and the assumption that your first success matters more than your theoretical maximum.
That setup teaches the stack without demanding too much from the hardware or the operator.
When hosted APIs still win
Hosted APIs still win when you need the best available capability, minimal setup, strong multimodal support, very long contexts, vendor-managed uptime, or features that are awkward to reproduce locally.
The first case is simply model quality. If the task needs the strongest reasoning, the best tool use, or the most robust handling of messy real-world edge cases, frontier hosted models are often ahead. Local open models have improved enormously, but “good enough locally” and “state of the art” are not the same thing.
The second case is time. Installing runners, choosing quantizations, debugging templates, and managing storage are all very solvable problems. They are also still problems. If your actual job is not “learn local inference,” the friction can dominate the savings.
The third case is elasticity and concurrency. Serving one user locally is easy. Serving a team, or a product with bursty demand, is much easier with a hosted provider unless you are prepared to operate real inference infrastructure. vLLM narrows that gap for self-hosting, but it does not erase the ops burden.
Hosted APIs also win on feature surface. Tool calling, audio, image generation, image understanding, embeddings, eval integrations, and enterprise controls are often much more polished in vendor platforms. Some local stacks can approximate parts of this, and Unsloth claims support for tool-calling, web search, and multimodal inputs in its local environment, but the integration burden remains yours.
There is also the question of quality drift in the other direction. People sometimes imagine local models as static and therefore always reliable. They are static only if you keep the rest of the system static. Change the quantization, prompt template, runtime, or sampling defaults and the model can feel different quickly. Hosted systems have platform drift; local systems have configuration drift.
A final honest limit is evaluation. It is easy to fool yourself with local setups because the model is private, fast enough, and under your control. That can make mediocre systems feel better than they are. Benchmark numbers can mislead, and cherry-picked demos are cheap. The only good antidote is to test on your actual tasks.
That is why the best way to think about local LLMs is not as a replacement religion for cloud AI. It is a deployment option with a specific shape:
- great for privacy-sensitive single-user or small-team workflows,
- strong for fixed tasks and cost-aware iteration,
- excellent for experimentation and developer control,
- weaker for peak capability, effortless scaling, and managed complexity.
If you choose the stack in that order, you usually end up with something useful instead of something impressive-looking and annoying.
Key Takeaways
- A local LLM is a stack of choices, model, format, runtime, app, and serving layer, not a single app to rank.
- VRAM is usually the binding constraint for useful local inference, but context length, KV cache growth, and runtime overhead decide whether a model is merely loadable or actually practical.
- GGUF is the practical default for laptop and desktop inference, while safetensors is more common in training and framework-native serving workflows.
- Ollama and LM Studio are the easiest starting points,
llama.cppis the control-first engine underneath much of the space, and vLLM is the production-serving path documented for high-throughput inference. - LoRA and QLoRA make local fine-tuning feasible on consumer hardware, but many performance gains come from better prompts, retrieval, or runtime orchestration instead of training.
- Hosted APIs still win when you need frontier capability, minimal setup, broad feature support, or reliable multi-user scale.
Further Reading
- Ollama, Canonical site for the CLI-first local model runner and local server.
- LM Studio, Canonical site for the beginner-friendly GUI local model app and local server.
- llama.cpp, Reference implementation for portable local inference across a wide range of hardware.
- GGUF · Hugging Face Docs, Documentation for the GGUF format, metadata, and supported tooling.
- vLLM Documentation, Reference docs for production-grade serving of open models.
- Unsloth, Official site for local training, LoRA/QLoRA optimization, and export paths.
- Apple Metal, Apple’s GPU acceleration documentation for the Mac hardware path.
- Hugging Face GGUF model filter, Useful hub entry point for finding GGUF checkpoints and quantized variants.
