November 28, 2024

AI-Powered Local Development

Exploring Apple's MLX framework for running large language models locally on Apple Silicon

Why Run LLMs Locally?

I've been using cloud-based AI tools like Claude Code for a while now (see my previous post on that), but I kept wondering: how capable are the models you can run right on your own machine? Apple's M-series chips have unified memory architectures that make them surprisingly well-suited for machine learning workloads, and Apple released an open-source framework called MLX specifically designed to take advantage of this hardware. I wanted to find out what these local models could actually do, how fast they could run, and where the limits are. So I built a toolkit to test it all out.

There's also a privacy angle worth mentioning. Running models locally means your prompts never leave your machine. For certain use cases—working with proprietary code, personal data, or just preferring to keep things offline—that matters. It's not the main reason I started this project, but it's a nice benefit that comes along for free.

The Setup: MLX on Apple Silicon

Apple's MLX framework is purpose-built for Apple Silicon. It takes advantage of the unified memory architecture on M-series chips, meaning the GPU and CPU share the same memory pool—no copying data back and forth between them like you'd have to on a traditional setup with a discrete GPU. For running language models, this is a big deal. Models that would normally require dedicated GPU VRAM can instead just use your regular system memory.

I built my toolkit on top of mlx-lm, a Python library that wraps MLX for language model tasks. The toolkit includes 11 example scripts covering everything from basic text generation and streaming output to a full REST API server, a Gradio web interface, fine-tuning with LoRA, prompt caching, and model quantization. Think of it as a comprehensive test bench for seeing what local LLMs can do on a Mac. You can run individual scripts directly or use the interactive menu system to try each feature.

Model Quantization: Fitting Big Models on Small Machines

The most interesting technical piece of this project was working with quantized models. Full-precision language models are huge—a 7 billion parameter model can easily eat 14GB of memory in its native form. Most people don't have that kind of headroom to spare. Quantization compresses the model weights from 16-bit or 32-bit floating point numbers down to 4-bit integers. You lose some precision, but in practice the quality drop is often barely noticeable for conversational and coding tasks.

The models I tested are all 4-bit quantized versions from the mlx-community collection on Hugging Face. A 7B parameter model that would normally need ~14GB of memory fits comfortably in ~4-5GB after 4-bit quantization. That's the difference between "won't run on my machine" and "runs with room to spare." The toolkit includes a model conversion script that lets you take any Hugging Face model, convert it to MLX format, and quantize it—so you can experiment with whatever models interest you.

The Benchmarks: Model vs. Model

Here's the part I was most curious about. I tested several popular open-source models on my M2 Pro with 16GB of RAM to see how they compare in speed, memory usage, and overall usability. All models are 4-bit quantized:

Model Size Load Time Tokens/sec Memory
Llama 3.2 3B ~2GB 3-5s 40-60 ~3GB
Mistral 7B ~4GB 5-8s 25-35 ~5GB
Qwen 2.5 7B ~4GB 6-10s 20-30 ~5GB

The Llama 3.2 3B model is the speed king. At 40-60 tokens per second, responses feel nearly instant—you're getting output faster than you can read it. It loads in a few seconds and only uses about 3GB of memory, leaving plenty of room for everything else. The tradeoff is that it's a smaller model, so its responses aren't as nuanced as the 7B options for complex reasoning tasks. For quick Q&A, summarization, and straightforward coding questions, it's excellent.

Mistral 7B hits a nice sweet spot. It's noticeably more capable than the 3B model for longer, more thoughtful responses, and 25-35 tokens per second is still very usable—you won't be waiting around. It's my go-to for general-purpose local inference. Qwen 2.5 7B performs similarly in terms of speed and memory, but I found it slightly better for multilingual tasks and structured outputs. Both 7B models are comfortable to run on a 16GB machine, though you probably don't want to run them alongside memory-hungry apps.

What I Built With It

The toolkit isn't just benchmarks—it's a collection of practical tools I actually use. The chat interface runs in the terminal with conversation history, so I can have multi-turn conversations with a local model without any internet connection. The FastAPI server exposes a REST API that mimics the structure of cloud AI APIs, which means you can point existing scripts or tools at localhost:8000 instead of a cloud endpoint. The Gradio web interface gives you a browser-based chat with parameter controls if you want to experiment with temperature, top-p, and other generation settings without touching code.

I also built a fine-tuning pipeline using LoRA (Low-Rank Adaptation), which lets you customize a model's behavior with a small dataset without retraining the whole thing. The training datasets I put together include customer service conversations, instruction-following examples, and some more creative ones. LoRA adapters are tiny—just a few megabytes—so you can have multiple specialized versions of a model without duplicating the base weights.

Prompt caching was another useful addition. If you're repeatedly querying a model with the same system prompt or context prefix, caching the key-value pairs from that prefix means the model doesn't have to reprocess it every time. For workflows where you're asking many questions about the same document or codebase, the speedup is significant.

Where Local Models Fall Short

I want to be honest about the limitations. These local models are not replacements for cloud services like Claude or GPT-4 for complex tasks. A 7B parameter model running locally is fundamentally less capable than a model with hundreds of billions of parameters running on a datacenter's worth of hardware. Long-form reasoning, nuanced code generation, and tasks that require extensive world knowledge are areas where the gap is most obvious. I still reach for Claude Code when I need the heavy lifting done.

Memory is the other constraint. On a 16GB machine, a 7B model is about the practical ceiling if you want to keep using your computer for other things. The M-series chips with 32GB or more open up bigger models, but at 16GB you're making tradeoffs. That said, for the use cases where local models work well—quick lookups, offline coding assistance, data processing without sending data to the cloud—they work really well.

Key Takeaways

After spending time building and testing this toolkit, here's what I've learned:

  • Apple Silicon is legit for ML - The unified memory architecture makes running quantized LLMs practical on consumer hardware. MLX takes full advantage of it.
  • 4-bit quantization is the enabler - Without quantization, most useful models wouldn't fit. With it, a 7B model runs comfortably on a 16GB MacBook.
  • 3B models are underrated - If speed matters more than depth, Llama 3.2 3B at 40-60 tokens/sec is hard to beat for everyday tasks.
  • 7B is the sweet spot at 16GB - Mistral 7B gives you a good balance of capability and speed without starving the rest of your system.
  • Local and cloud aren't either/or - Use local models for quick, private, offline tasks. Use cloud models for the heavy stuff. They complement each other well.

The full toolkit is open source on GitHub if you want to try it yourself. All you need is a Mac with Apple Silicon and Python installed.

Read More

Explore more articles and project deep-dives