Self-Hosting AI: Running Whisper Locally on Mac and Windows

Running AI models locally used to be a niche pursuit for ML engineers. But with tools like Whisper, Ollama, and modern GPU hardware, self-hosting AI is practical for everyday developers. Here's how to set up local Whisper transcription on Mac and Windows.

Why Self-Host AI?

  • Privacy: Your data never leaves your machine
  • Cost: No API fees, no rate limits
  • Reliability: No outages, no rate limiting
  • Learning: Understanding how AI actually works

Mac Setup (Apple Silicon)

Option 1: whisper.cpp (Fastest)

# Install
brew install whisper-cpp

# Download a model (base is good for most use cases)
curl -L -O https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin

# Transcribe
whisper-cpp -m ggml-base.en.bin -f audio.mp3 --output-txt

whisper.cpp uses Metal acceleration on Apple Silicon. On my M2 MacBook Pro, I get ~10-20x realtime transcription speed.

Option 2: Python + Whisper (Easier API)

# Install
pip install openai-whisper

# Use
whisper audio.mp3 --model base --language en

This uses PyTorch with Metal backend. Slower than whisper.cpp but more flexible.

Option 3: Ollama with Whisper

Ollama recently added Whisper support:

# Install
brew install ollama

# Pull and run
ollama pull whisper
ollama run whisper audio.mp3

Windows Setup

GPU Acceleration

For best performance, you need NVIDIA CUDA or AMD Vulkan:

  • NVIDIA: Install CUDA toolkit, use whisper.cpp with CUDA support
  • AMD/Intel: Use whisper.cpp with Vulkan support

Option 1: whisper.cpp

# Download pre-built binary
# https://github.com/ggerganov/whisper.cpp/releases

# Download model
Invoke-WebRequest -Uri "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin" -OutFile "ggml-base.en.bin"

# Transcribe
.main.exe -m ggml-base.en.bin -f audio.mp3 --output-txt

Option 2: Python + Whisper

# Install Python 3.11+ if needed
# Then:
pip install openai-whisper

# Run
whisper audio.mp3 --model base

Model Selection Guide

ModelSizeSpeedAccuracyUse Case
tiny39MBFastestGoodReal-time, quick notes
base74MBFastBetterDefault choice
small244MBMediumGreatWhen accuracy matters
medium769MBSlowerExcellentProfessional use
large-v31.5GBSlowestBestMaximum accuracy

For meetings, base or small is usually sufficient. I use base.en (English-only) for faster processing.

Integration Tips

Voice Activity Detection (VAD)

Don't transcribe silence. Use Silero VAD to detect speech:

import torch

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad')
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

wav = read_audio('audio.wav')
speech_timestamps = get_speech_timestamps(wav, model)
# Only transcribe speech segments

VAD can reduce transcription time by 60-80%.

GPU Memory Management

Large models need GPU VRAM:

  • small: ~1GB VRAM
  • medium: ~2.5GB VRAM
  • large-v3: ~5GB VRAM

On Mac, unified memory means these requirements are less strict. On Windows with discrete GPUs, check your VRAM.

What I Built

I packaged all of this into Clearminutes – a meeting assistant that uses local Whisper for transcription. No setup required, no command line, just download and run.

Under the hood, Clearminutes uses:

  • whisper.cpp (whisper-rs bindings) for transcription
  • Metal acceleration on Mac
  • CUDA/Vulkan on Windows
  • Silero VAD for speech detection
  • Ollama for local summarization

If you want local AI without the setup hassle, give it a try.

The Future is Local

We're at an inflection point. Local AI is becoming practical for everyday use. The models are good enough, the hardware is fast enough, and the tools are mature enough.

The question isn't whether to self-host AI. The question is: why would you want your data anywhere else?