I think a lot of people have heard of OpenAI’s local-friendly Whisper model, but I don’t see enough self-hosters talking about WhisperX, so I’ll hop on the soapbox:
Whisper is extremely good when you have lots of audio with one person talking, but fails hard in a conversational setting with people talking over each other. It’s also hard to sync up transcripts with the original audio.
Enter WhisperX: WhisperX is an improved whisper implementation that automatically tags who is talking, and tags each line of speech with a timestamp.
I’ve found it great for DMing TTRPGs — simply record your session with a conference mic, run a transcript with WhisperX, and pass the output to a long-context LLM for easy session summaries. It’s a great way to avoid slowing down the game by taking notes on minor events and NPCs.
I’ve also used it in a hacky script pipeline to bulk download podcast episodes with yt-dlp, create searchable transcripts, and scrub ads by having an LLM sniff out timestamps to cut with ffmpeg.
Privacy-friendly, modest hardware requirements, and good at what it does. WhisperX, apply directly to the forehead.
Hmm… Would be interesting to find out what kind of effect that has on the average marriage or relationship 😅
“You love the robot more than me!” 💔️
“WELL AT LEAST THE ROBOT LISTENS TO ME”
I mean, I’d imagine probably not a good one :) Somehow I imagine asking the AI to record a conversation, is an instant arguement escalator… as is asking to read the facts back, and usually the topic would be switched rather than one side admitting their fault in the conversation.
Actually I think there’s a black mirror episode on roughly that (not a device for recording audio when asked, but everyone having a chip in their head that automatically records their memories, and a huge fight when a husband discovers his wife deleted a few hours of recordings.
That was a great episode!