Local Speech Detection Filters Clips Before AI Processing

Before Bitcut sends any audio to the server for AI transcription, it first runs a quick speech detection pass right on your device. This local analysis determines which clips contain speech and which are purely visual (music, ambient sound, silence). Only clips with detected speech are sent for transcription.

Why It Runs Locally

On-device speech detection provides three key benefits:

  • Speed — local analysis takes just a few seconds, even for multiple clips
  • Privacy — audio from clips without speech never leaves your device
  • Quota savings — clips without speech are skipped entirely, so they don't count toward your AI minutes quota

How It Works

1

Audio analysis

When you add clips using Smart Add with AI or Clips Enhancement, Bitcut analyzes the audio waveform of each clip on your device.

2

Speech vs. non-speech classification

Each clip is classified as containing speech or not. Clips are processed in parallel, so even a batch of 10 clips takes only a few seconds.

3

Routing

Clips with speech proceed to server-side AI transcription. Clips without speech are handled differently — they're processed with smart trimming based on visual content instead.

Tip: If you have a mix of talking-head clips and B-roll, speech detection ensures only the talking-head clips use your AI quota. B-roll clips are trimmed using visual analysis instead.

What You See

During speech detection, clips on the timeline show a magnifying glass icon with the "Analyzing" state. This phase is fast — usually 1-2 seconds per clip. Once analysis is complete, clips either move to the transcription phase or are immediately trimmed and marked as complete.

Duration filter: Clips under 3 minutes are eligible for AI processing. Longer clips should be processed using Generate Shorts instead.