Don’t Sleep on Local and Specialized Models

Don’t Sleep on Local and Specialized Models

A lot of the AI conversation still revolves around the biggest cloud models.

Claude.
ChatGPT.
Gemini.
Codex.

And to be clear, I get it.

I still use Claude and Codex for most of my serious coding work. When I need deep reasoning, architecture help, debugging, planning, or a strong general-purpose collaborator, the frontier models are still where I spend most of my time.

But I think there’s another part of the AI stack that people are underestimating.

Local models.
Specialized models.
Narrow tools that are really good at one specific part of the workflow.

Not because they are going to replace the frontier labs.

They probably won’t.

At least not broadly.

OpenAI, Anthropic, Google, and others have enormous advantages. They have more compute, more research talent, more infrastructure, and more ability to push the edge of general intelligence.

But in actual product workflows, that may not matter as much as people think.

Because most products don’t need one model to do everything.

They need the right model for the right job.

And increasingly, I think the future of AI product development looks less like this:

Pick the smartest model and send everything to it.

And more like this:

Use expensive frontier models for the hard thinking, then use local and specialized models for execution.

That split is already starting to show up in my own workflows.

The model is not the workflow

When people talk about AI tools, they often talk about them like you have to pick one.

Which coding tool do you use?

Which image model do you use?

Which voice model do you use?

Which chatbot do you use?

I think that’s the wrong framing.

The model is not the workflow.

The workflow is the system you build around the models.

I still use Claude and Codex heavily for coding. They are great at reasoning through messy problems, shaping architecture, writing code, reviewing edge cases, and pushing through ambiguity.

But once I get into content production, media generation, and repeatable execution work, the stack starts to look very different.

Some models are better for transcription and timing.

Some are better for voice.

Some are better for image generation.

Some are better for character consistency.

Some are better for cleanup, background removal, or formatting.

Some are better for animation or short-form video.

And some are only worth using when the quality premium justifies the cost.

That’s the important part.

The goal is not to find one perfect AI tool.

The goal is to build a workflow where each model does the thing it is best at.

The Exact Stack Matters Less Than the System

The specific tools are going to keep changing.

That is part of the point.

Six months from now, some of the models I use today will be better. Some will be cheaper. Some will be replaced entirely. Some new model nobody is paying attention to right now may become the obvious choice for one step of the process.

So I don’t want to build a workflow that depends too heavily on one vendor, one model, or one assumption about what will be best.

The architecture is the durable part:

  • Use frontier models where judgment matters.
  • Use local models where iteration speed matters.
  • Use specialized models where a narrow task has a better tool.
  • Use hosted APIs where the quality premium is worth the cost.
  • Build the workflow so pieces can be swapped out as better options emerge.

That is the part I think more teams should be thinking about.

The exact stack matters less than the philosophy behind it.

Because six months from now, some of the specific tools will be better. Some will be cheaper. Some will be replaced entirely. Some new model nobody is paying attention to right now may become the obvious choice for one step of the process.

That’s why I don’t want to build a workflow that depends too heavily on one vendor, one model, or one assumption about what will be best.

I want a system that can evolve.

Cost changes behavior

For us, the target is repeatable but cost-efficient production.

That matters a lot.

If we are still iterating on scripts, I don’t want every draft, test, and small change to create a new premium rendering cost.

Voice is a good example.

If every voice generation runs through a high-end hosted API, it changes the creative process. You start locking scripts earlier. You become more careful before rendering. Every iteration has a cost attached to it.

That might be fine at the final production stage.

But it is not ideal during exploration.

When local TTS is good enough for a big chunk of the workflow, the economics change.

Now we can experiment more freely.

We can tweak scripts.
Regenerate lines.
Try different pacing.
Adjust tone.
Test narration.
Throw away bad versions.

Without worrying that every small change creates another bill.

That is a big deal.

Not because local is always better.

It isn’t.

Sometimes the best model is still the expensive cloud model. Sometimes the quality gap matters. Sometimes the hosted API is more reliable, more scalable, or easier to operationalize.

But sometimes “best” means something else.

Sometimes best means cheap enough to use constantly.

Sometimes it means fast enough to stay in the creative flow.

Sometimes it means private enough to run close to your own data.

Sometimes it means customizable enough for your specific use case.

Sometimes it means good enough that the premium option stops being worth it for that step of the workflow.

That’s where local and specialized models get interesting.

Frontier models for thinking, specialized models for doing

The pattern I keep coming back to is pretty simple:

Use frontier models for thinking.

Use specialized models for doing.

That doesn’t mean the smaller models are dumb. It just means they don’t have to be world-class at everything.

They need to be good at their part of the system.

One model plans.

Another generates.

Another transcribes.

Another cleans up.

Another creates.

Another removes.

Another varies.

Another reviews.

Another packages.

The frontier model may still be the brain.

But the rest of the system may be made up of smaller, cheaper, narrower tools that are better suited for execution.

That architecture also makes the whole system more flexible.

As new tools emerge, you can swap them in.

If a better TTS model shows up, replace that part.

If a better image variation model appears, plug it in.

If local video generation gets good enough, move that piece closer to your own infrastructure.

If a hosted model becomes too expensive, route around it.

If a niche model gets better at one task than the general-purpose model, use it.

That is the real advantage.

Not just using AI.

Not just picking the “best” AI tool.

But building a workflow that can evolve as the model landscape changes.

Because this space is moving too fast to hard-code your entire process around one vendor, one model, or one assumption about what will be best six months from now.

The coding version of this

The next place I want to test this more seriously is coding.

Right now, I still rely heavily on Claude and Codex for software work. But even there, I’m starting to wonder if the same pattern applies.

Maybe the most expensive reasoning model doesn’t need to do every step.

I already use heavier models for planning and architecture, then cheaper or faster models for more of the implementation.

In Claude terms, Opus may do the thinking while Sonnet does more of the build work.

The question is whether a local coding model can replace some of that implementation layer.

Can Claude or Codex create the plan, then hand the scoped execution to a local model?

Can the local model make the code changes, run the tests, fix the obvious issues, and get close enough to what I’d expect from Sonnet?

I don’t know yet.

But that’s the experiment.

And if it works, the implications are meaningful.

Because agentic development is not one prompt and one answer.

It’s loops.

Plan.
Implement.
Review.
Fix.
Test.
Restart.
Try again.

If every loop runs through the most expensive model, the economics start to matter.

But if the highest-cost model only handles the highest-judgment parts of the process, and local models handle more of the execution, the operating model changes.

That could make agentic development cheaper, faster, and easier to scale.

Not because local coding models are suddenly better than Claude or Codex at reasoning.

But because they may not need to be.

They may just need to be good enough at following a well-scoped plan.

That is a very different bar.

The best model depends on the job

The teams that win probably won’t be the ones who use the most expensive model for everything.

They’ll be the ones who understand the job to be done at each step, then choose the right model for that job.

Sometimes that will be a frontier reasoning model.

Sometimes it will be a coding model.

Sometimes it will be a local model.

Sometimes it will be a hosted API.

Sometimes it will be an open-source tool.

Sometimes it will be some weird little model that most people haven’t heard of yet, but that happens to be perfect for one part of the pipeline.

That is where this gets exciting.

The future of AI products probably isn’t one model doing everything.

It’s a stack.

Frontier models for thinking.

Specialized models for doing.

Local models where cost, speed, privacy, iteration, and customization matter.

And a workflow flexible enough to keep swapping in better pieces as the space matures.

That’s why I wouldn’t sleep on local models.

They may never beat Claude or ChatGPT at general reasoning.

They don’t have to.

They just have to be good enough at the execution layer.

And in a lot of real product workflows, the execution layer is where the volume is.

That’s where the cost is.

That’s where the iteration happens.

That’s where the bottlenecks show up.

That’s where customization matters.

And that’s where a stitched-together system of specialized models may end up being far more powerful than one expensive model trying to do everything.

Comments

Leave a Reply

Discover more from Trent Cockerham

Subscribe now to keep reading and get access to the full archive.

Continue reading