Speech-to-text models are the magic behind converting spoken language into written text. From dictating documents and emails to transcribing meetings, their applications are vast. Personally, I rely on this feature to send hands-free messages and control my smart home devices.
While those everyday uses are great, I recently discovered an even more powerful application for speech-to-text. When you combine speech recognition with the power of LLMs, things get really interesting. We’re talking Q&A documents, meeting summaries, and even enhancing RAG models.
Inventing a problem
If you’ve been following my articles, you know I’m a learn-by-doing kind of person. To really grasp something, I need a real-world problem to solve. Well, I found one that’s perfect for exploring speech recognition models.
We frequently conduct demos for our sales teams, customers, and partners. We’ve discovered that sharing a summary or Q&A document after these meetings, highlighting key points, adds tremendous value. However, doing this manually is time-consuming. Even if the demo itself is standard, the discussions and questions that arise are unique to each meeting. A document tailored to the specific conversation is far more valuable than a generic FAQ or summary.
Most of our meetings are done on Zoom. While Zoom’s transcription feature seems convenient, it often falls short when it comes to accuracy. You know that for sure, when they add these to their “Best practices for audio transcription”, document
- Keep the background noise to the minimum.
- Ask the participants to speak clearly into the microphone, and refrain from shuffling papers, typing loudly, or talking among themselves.
- Place the microphone near active speakers, such as those actively participating in a meeting where multiple people share a dial-in.
- Choose an external microphone over a built-in one for better sound quality.
On top of everything else, English isn’t my first language. I have a strong Indian accent, and I might even put the emphasis on the wrong syllable sometimes! This definitely makes things tricky for those speech recognition models.
The setting
We had just finished a sales enablement session and I had to create a recap document. This document needed to summarize the key takeaways from the call and provide actionable steps for closing deals. I’m not gonna lie, I’m a bit lazy. So, I started brainstorming ways to get a speech model and an LLM to do the heavy lifting for me.
I figured if I could get an accurate transcript of the meeting, I could then feed that into Google Gemini and let it generate a draft document. Even if it wasn’t perfect, it would at least give me a solid starting point.
Getting started.
Since I use Gemini Pro often, Google’s Speech-to-text API seemed like the ideal solution, so I decided to give it a shot.
Google’s speech-to-text API uses Chirp. Google Cloud’s documentation is straightforward, and you can set everything up right in their console. After checking the pricing, I did a quick estimate and found that the Speech-to-Text V2 API would only cost about a dollar per hour. I won’t need this running all the time, at most I’ll have 2-3 hours of audio to transcribe.
The audio file format - m4a
Zoom records audio as m4a format.
Why m4a
format? Audiobook and podcast files, which also contain metadata including chapter markers, images, and hyperlinks, can use the extension m4a
, but more commonly use the .m4b
extension (Source: Wikipedia).
Google’s service couldn’t understand m4a
format. So I had to convert m4a
into mp3
using ffmeg
. A quick search gave the command to convert m4a
to mp3
ffmpeg -i input.m4a -c:v copy -c:a libmp3lame -q:a 4 output.mp3
The mp3 file is what I used for transcribing.
Google Model’s Output
I figured the transcription would only take a few minutes, tops. Boy, was I wrong! A one-hour MP3 audio file took a full hour to transcribe using Google’s speech-to-text service.
I couldn’t wait to see the results, but when I opened the file… let’s just say I wasn’t impressed.
and ife any soft dis is deyar shud go off D for over dis is D setup date doing Nau so hot ise 100 balance ine D distract account
What the heck is that?
Salvage efforts
The quality of the transcribed output was severely lacking. In an attempt to salvage the situation, I provided Gemini with the transcript and the original questions, hoping it could generate a usable document. This was my prompt for Gemini,
Understand this transcribed file,
Once done, generate a general FAQ document based on the concepts discussed on the call followed by a QnA section based on the questions asked.
Success!
Gemini’s ability to understand the garbled transcript and provide accurate answers to follow-up questions was truly impressive. It wasn’t just about understanding the transcript; Gemini also grasped the underlying facts, making it a remarkable achievement.
Rediscovery
I bounced this off some friends and they expressed concern that a dollar an hour represented a considerable expense. I was asked to check out Whisper.
Whisper
OpenAI’s Whisper is truly impressive. Not just that it’s open-source, but it’s also compatible with my M2 Macbook Pro. I transcribed the hour-long audio file in less than 7 minutes, a significant improvement over other options. While Whisper can be used from the command line, I personally prefer a dedicated application for ease of use.
|
|
Here’s a sample output from the transcribed output from Whisper
The primary purpose here is to automate all the manual processes around it because...
Comparison
I conducted an experiment with Gemini, initiating two separate chats with identical prompts but using different transcriptions. While both generated documents were reasonably accurate, the one based on the Whisper-generated transcript was notably more comprehensive. Let’s compare the opening paragraphs from each to illustrate this difference.
Google Chirp | Whisper 🏆 |
---|---|
The transcribed file is a discussion about a financial product that automates escrow and payout operations. The product is designed to handle complex financial transactions involving multiple parties, such as project financing, trust and retention accounts, and e-commerce payments. The system allows for the configuration of various aspects of the deal, including payment structures, approvals, checklists, and notifications. | What is FinHub? FinHub is a product that helps banks automate their escrow and payout operations. It’s not limited to just buyer-seller transactions; it can handle various use cases like monitoring accounts, trust and retention accounts, project financing, escrow payments, and payouts. The goal is to automate manual processes, making it easier for banks to serve their customers and streamline operations. |
The verdict is clear: Whisper is the superior choice for transcription. Its accuracy surpasses Google’s model, and its ability to process audio files in a fraction of the time is remarkable.
And not to forget the option to run it directly on my laptop. 😀