Microsoft this week presented Foundry Local as a way to ship AI models inside apps that run entirely on a user’s device, packaging a workflow that local-LLM users already know from tools such as Ollama and LM Studio into a smaller, more app-facing runtime.
In its microsoft/Foundry-Local repository, the company says the runtime is about 20 MB and handles model download, caching, inference, and hardware selection through ONNX Runtime, with native SDKs for C#, JavaScript, Python, and Rust.
That is the part that is actually new. Foundry Local is not “Microsoft invents local AI”; it is Microsoft trying to make the familiar local stack look turnkey for developers who want to embed it in a product rather than run a sidecar tool. The repo’s headline features are a curated model catalog, automatic selection of NPU, GPU, or CPU, and OpenAI-compatible request and response formats.
“Ship on-device AI inside your app,” the README says. It describes Foundry Local as an end-to-end local AI solution in which user data stays on device, the app works offline, and developers do not need API keys, backend infrastructure, or an Azure subscription.
Hugging Face’s current local-app documentation shows why Microsoft has chosen this lane. The Hub now treats local inference as a normal path: users can filter for supported local apps on a model page, pick a target app from a “Use this model” menu, and copy the command to run it. The supported tools Hugging Face names include llama.cpp, Ollama, Jan, and LM Studio. That is a quiet but important shift: local use is no longer framed as a weird advanced setup.
LM Studio and Ollama still look more familiar for people already running models on their own machines. Hugging Face describes Ollama as easy to install with direct Hub integration, while LM Studio is presented as a desktop app for downloading, running, and experimenting with local LLMs through a GUI. In practice, those tools are the path many users already take, as we covered in our look at the local LLM stack: pick a model, pull it down, run it locally, then point another app at a local API if needed.
Foundry Local changes the packaging more than the core idea. Instead of asking developers to tell users to install Ollama or LM Studio first, Microsoft is offering an embeddable runtime plus a curated catalog of optimized models, including chat models such as GPT OSS, Qwen, DeepSeek, Mistral, and Phi, and transcription models such as Whisper. The catch is that these are repo claims, not independent benchmarks, and the README does not provide a head-to-head table against Ollama or LM Studio for speed, memory use, or model coverage.
The broader ecosystem also makes Microsoft’s timing easy to read. Hugging Face’s documentation normalizes desktop local apps, but recent GitHub projects show the same pattern spreading well beyond laptops. The jegly/OfflineLLM Android app says it runs GGUF models fully on-device through llama.cpp, has no internet permission in the manifest, and lets users import their own model files. NVIDIA’s jetson-copilot reference app runs an Ollama server and Streamlit UI in Docker on Jetson hardware. OpenBMB’s MiniCPM-V-Apps repo ships fully offline multimodal demos for iOS, Android, and HarmonyOS through llama.cpp.
If you are wondering what all of these have in common, it is basically the same pipeline wearing different clothes: a local runtime, compressed model weights, device-specific acceleration, and some way to wrap that in an app. That is also the direction behind projects such as Unsloth Studio and the current push toward local LLM coding. The interesting question is not whether local AI exists. It plainly does. The question is who makes it easiest to ship.
There are still practical caveats. OfflineLLM supports arm64-v8a only. Jetson Copilot needs internet on first launch to pull containers and default models. MiniCPM-V-Apps requires platform-specific setup or prebuilt packages, and its repo spells out minimum hardware for different models. Foundry Local promises automatic hardware acceleration and model selection, but those claims will matter most once developers have tested it across the untidy mix of NPUs, GPUs, and drivers real users actually have.
Microsoft is positioning Foundry Local as that abstraction layer now. The repo is already public on GitHub, where the company is documenting the runtime, SDKs, and model catalog.
Key Takeaways
- Microsoft says Foundry Local is an on-device AI runtime for apps with a footprint of about 20 MB.
- The main difference from Ollama and LM Studio is packaging: Foundry Local is designed to be embedded inside an app with automatic hardware selection and curated models.
- Hugging Face now treats local model use as a standard workflow and explicitly supports Ollama, LM Studio, Jan, and
llama.cpp. - Recent repos show the same local-first pattern extending to Android, iOS, HarmonyOS, and Jetson devices.
- The biggest open question is not capability but real-world performance across varied consumer hardware.
Further Reading
- GitHub – microsoft/Foundry-Local, Microsoft’s Foundry Local README and feature list for its on-device AI runtime, SDKs, model catalog, and hardware acceleration.
- Use AI Models Locally · Hugging Face, Hugging Face Hub docs explaining local model workflows and naming Ollama, LM Studio, llama.cpp, and Jan as supported local apps.
- GitHub – jegly/OfflineLLM, Android offline AI chat app repo showing on-device llama.cpp inference, importable GGUF models, and no network permission.
- GitHub – NVIDIA-AI-IOT/jetson-copilot, Reference app for a local AI assistant on Jetson that uses Ollama and a Streamlit app in Docker.
- GitHub – OpenBMB/MiniCPM-V-Apps, On-device multimodal demos for iOS, Android, and HarmonyOS NEXT using llama.cpp.
