Whisper and Pyannote: The Ultimate Solution for Speech Transcription

In the digital age, voice remains an omnipresent medium of information, from business meetings to multimedia content and interactions with artificial intelligence. However, effectively leveraging this vocal data remains a challenge. This is where automatic speech recognition (ASR) and diarization technologies come into play, transforming speech into a usable, analyzable, and structured format.

This article explores two tools in these fields that you can use locally: Whisper, a speech recognition model developed by OpenAI, and Pyannote, a diarization solution that identifies speakers in an audio file. We will see how these technologies can be combined to achieve accurate transcriptions segmented by speaker, as well as their main applications and challenges. Furthermore, using them locally ensures the security of your sensitive data and privacy.

The examples in this article are taken from the audio transcription application available at:

jeanjerome/EchoInStone Public

1. Understanding Automatic Speech Recognition (ASR)
- Basic Functioning
- The Whisper Model
  - Advantages and Features:
  - Implementing Whisper in Python
2. Diarization: Identifying Speakers
3. Aligning Transcriptions with Audio Segments
- Alignment Algorithms
  - Alignment Methodology Used
  - Implementation in Python
4. Potential Uses of Transcriptions
5. Challenges and Future Perspectives
6. Conclusion

1. Understanding Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR) is a technology that converts human speech into text. It relies on models capable of analyzing an audio signal and extracting spoken words. ASR has applications in various fields, including accessibility for the hearing impaired, transcription of meetings or interviews, interaction with voice assistants, and indexing of audio and video content. Its effectiveness depends on several factors, such as sound quality, accent diversity, and background noise.

Basic Functioning

ASR operates through several key steps:

Preprocessing: The audio signal is cleaned to reduce background noise and improve recording quality, then converted into a format usable by a model, often in the form of a spectrogram.
Acoustic modeling: Acoustic features such as frequency and amplitude are extracted from the audio signal to identify corresponding phonemes, the basic units of spoken language.
Language modeling: An algorithm predicts the most probable words and phrases based on the detected phonemes and contextual information.
Decoding and correction: The model adjusts the final transcription by accounting for potential errors and linguistic context.

Modern ASR systems utilize deep neural networks, which enhance accuracy and accommodate greater linguistic diversity.

The Whisper Model

Developed by OpenAI, Whisper is an advanced ASR model known for its robustness and versatility. It has been trained on a large volume of multilingual data, enabling it to perform effectively across various languages, accents, and contexts.

Advantages and Features:

High accuracy: Whisper handles diverse accents and noisy environments well, improving transcription quality.
Multilingual support: It supports multiple languages and can automatically translate transcriptions.
Subtitle generation: It enables the creation of synchronized subtitles for audio and video content.
Noise resistance: Its robustness against background noise makes it highly effective for real-world audio transcription.

Unlike traditional ASR models that require specific training on restricted datasets, Whisper is based on large-scale learning, allowing for better generalization. This makes it a powerful tool for any application requiring precise and high-quality transcription.

Implementing Whisper in Python

Integrating Whisper into a Python program is straightforward using the Hugging Face Transformers library. A practical implementation involves using a speech recognition pipeline optimized for different hardware configurations.

Here’s an approach based on the Whisper Large-v3 Turbo model (a lightweight version of Whisper Large v3):

Model loading: The WhisperAudioTranscriber is initialized, automatically configuring the computing device (GPU, MPS, or CPU) based on hardware availability.
Model and processor preparation: The Whisper model and its associated processor are loaded using AutoModelForSpeechSeq2Seq and AutoProcessor.
Pipeline configuration: A pipeline is set up with specific parameters, such as segmenting the audio into 5-second chunks with a 1-second overlap and enabling timestamp return (return_timestamps=True).
Audio file transcription: Calling the transcribe(audio_path) method executes the transcription and returns the text along with associated timestamps.

Here is a code snippet demonstrating this approach:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

class WhisperAudioTranscriber():
    def __init__(self, model_name="openai/whisper-large-v3-turbo"):
        # Configure the device for computation
        if torch.cuda.is_available():
            self.device = "cuda:0"
            self.torch_dtype = torch.float16
        elif torch.backends.mps.is_available():
            self.device = "mps"
            self.torch_dtype = torch.float16
        else:
            self.device = "cpu"
            self.torch_dtype = torch.float32

        # Load the model and processor
        try:
            self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
                model_name,
                torch_dtype=self.torch_dtype,
                low_cpu_mem_usage=True,
                use_safetensors=True,
            )
            self.model.to(self.device)

            self.processor = AutoProcessor.from_pretrained(model_name)

            # Configure the pipeline for automatic speech recognition
            self.pipe = pipeline(
                "automatic-speech-recognition",
                model=self.model,
                tokenizer=self.processor.tokenizer,
                feature_extractor=self.processor.feature_extractor,
                torch_dtype=self.torch_dtype,
                device=self.device,
                return_timestamps=True,
                generate_kwargs={"max_new_tokens": 400},
                chunk_length_s=5,
                stride_length_s=(1, 1),
            )
        except Exception as e:
            raise

    def transcribe(self, audio_path: str) -> tuple:
        try:
            # Perform transcription with timestamps
            result = self.pipe(audio_path)
            transcription = result['text']
            timestamps = result['chunks']
            return transcription, timestamps
        except Exception as e:
            return None, None

Python

This implementation allows for the automatic transcription of an audio file passed as a parameter. Whisper’s native support for timestamps facilitates the alignment of audio segments with other tools (which we will explore in the diarization section), making it particularly useful for applications requiring temporal tracking of dialogues.

2. Diarization: Identifying Speakers

Diarization is the technology used to segment an audio recording to identify different speakers. In other words, it structures conversations and answers the question: “Who is speaking and when?”.

Speaker identification is particularly useful in various contexts such as professional meetings, interviews, televised debates, or phone calls. It enhances the readability of transcriptions and facilitates the further analysis of audio data by associating each speech segment with its respective speaker.

Diarization Techniques

Diarization involves several key steps:

Segmentation: The recording is divided into smaller segments based on pauses and speaker changes.
Feature Extraction: Segments are analyzed to extract unique vocal fingerprints based on acoustic parameters (tone, frequency, intensity).
Clustering: Similar-sounding audio segments are grouped together to assign each segment to a unique speaker.
Identification: In some applications, diarization can be combined with speech recognition to attribute a name or role to each speaker.

Common approaches include neural network-based models, Gaussian Mixture Models (GMMs), and unsupervised clustering methods such as Spectral Clustering.

The Pyannote Model

Pyannote is an advanced diarization solution that leverages deep neural network models to identify speakers with high accuracy. Designed for seamless integration into audio processing pipelines, it can perform voice segmentation and identification in complex recordings, including those with background noise and overlapping speech.

Capabilities and Advantages of Pyannote:

High Accuracy: Leveraging deep learning, Pyannote enhances speaker detection and differentiation, even in challenging acoustic environments.
Adaptability: It can be used across various types of recordings, such as meetings, podcasts, or phone calls.
Modularity: Pyannote integrates seamlessly with automatic transcription tools like Whisper, enabling enriched transcriptions that include speaker identification.
Compatibility: It supports various audio formats and can be integrated into existing applications via APIs and Python scripts.

Combining Pyannote with Whisper allows for detailed transcriptions that specify not only the text but also the identity of the speakers.

Implementing Pyannote in Python

The implementation of Pyannote for diarization can be integrated into a Python program using the pyannote.audio library. This enables loading a pre-trained model and performing voice segmentation in just a few steps.

Here’s an approach based on the Pyannote Speaker Diarization 3.1 model:

Model Loading: An instance of the PyannoteDiarizer class initializes and loads the diarization model from Hugging Face.
Diarization Execution: The diarize(audio_path) method analyzes the audio file, identifying different speakers and segmenting their speech contributions.

Here is a code snippet to implement this approach:

from pyannote.audio import Pipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook
import torch

class PyannoteDiarizer:
    def __init__(self, hf_token: str):
        try:
            self.pipeline = Pipeline.from_pretrained(
                "pyannote/speaker-diarization-3.1",
                use_auth_token=hf_token
            )
            self.device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
            self.pipeline.to(self.device)
        except Exception as e:
            self.pipeline = None

    def diarize(self, audio_path: str):
        if self.pipeline is None:
            return None

        try:
            with ProgressHook() as hook:
                diarization = self.pipeline(audio_path, hook=hook)
                return diarization
        except Exception as e:
            return None

Python

This implementation provides precise segmentation of speakers in an audio recording, making it easier to align Whisper-generated transcriptions with the corresponding segments.

3. Aligning Transcriptions with Audio Segments

Aligning transcriptions with audio segments involves matching each transcribed word or phrase to the correct speaker. In a conversation, overlaps, interruptions, or monologues complicate this task, requiring precise synchronization between transcription data and speaker identification.

For example, Whisper generates a transcription with timestamps:

(from 0s to 3s) Hello, how are you?
(from 3s to 5s) Hello, I’m fine, thank you.

Meanwhile, Pyannote identifies speakers and segments the audio as follows:

SPEAKER_00: (from 0s to 3s)
SPEAKER_01: (from 3s to 5s)

The challenge of alignment is to merge these pieces of information to associate each transcribed sentence with the correct speaker, ensuring an accurate representation of the conversation.

This step is essential for several reasons:

Enhanced comprehension: A properly aligned transcription clearly identifies who is speaking and when, making the text more understandable, especially in multi-speaker conversations.
Synchronized subtitle generation: In videos and podcasts, precise alignment is crucial for creating coherent subtitles, improving accessibility and user experience.
Easier indexing and searching: Well-segmented text allows targeted searches for a specific speaker or a particular part of a conversation, which is useful for media analysis and journalism.
Conversational analysis and interaction tracking: Alignment helps better understand the dynamics of a discussion by identifying speech turns, interruptions, and exchanges between participants.
Optimized automated meeting summaries: In professional settings such as meetings or conferences, accurately linking transcriptions to audio segments facilitates the generation of structured and useful reports.

As you can see, aligning transcriptions with audio segments goes beyond simple time matching—it plays a crucial role in the effective utilization and enhancement of audio data.

Alignment Algorithms

Although it may seem simple, aligning transcriptions with audio segments is often complex and relies on merging data from automatic speech recognition (Whisper) and speaker diarization (Pyannote). The goal is to assign each transcribed text segment to the corresponding speaker while maintaining accurate timestamps.

Alignment Methodology Used

Alignment in this context is based on a temporal intersection approach between the segments provided by Whisper and those detected by Pyannote. The algorithm follows several steps:

Extracting Transcription and Diarization Segments
- Whisper generates a transcription divided into segments with associated timestamps.
- Pyannote segments the audio, assigning a speaker identifier to each time interval.
Matching Transcription Segments to Speakers
- For each segment produced by Whisper, the best corresponding segment is identified from Pyannote’s output.
- This match is based on the greatest temporal overlap between the transcribed segment and the detected speaker segments.
Handling Special Cases
- If a transcription segment extends beyond the ranges defined by diarization, the algorithm adjusts the alignment by considering the end of the last detected segment.
- If multiple speakers overlap, the algorithm selects the one with the longest intersection duration with the transcription segment.
Merging Consecutive Segments from the Same Speaker
- Consecutive segments belonging to the same speaker are merged to avoid excessive text fragmentation.

Implementation in Python

Here is an example of the full alignment process implementation:

The align() method associates each transcribed text segment with diarization data and returns a list where each element contains: the speaker, start and end timestamps, and the corresponding text.
The find_best_match() method searches for the diarization segment with the greatest temporal overlap with a given transcription segment.
The merge_consecutive_segments() method merges adjacent segments belonging to the same speaker to improve the coherence of the aligned transcription.

class SpeakerAligner():
    def align(self, transcription, timestamps, diarization):
        speaker_transcriptions = []

        # Find the end time of the last segment in diarization
        last_diarization_end = self.get_last_segment(diarization).end

        for chunk in timestamps:
            chunk_start = chunk['timestamp'][0]
            chunk_end = chunk['timestamp'][1]
            segment_text = chunk['text']

            # Handle the case where chunk_end is None
            if chunk_end is None:
                # Use the end of the last diarization segment as the default end time
                chunk_end = last_diarization_end if last_diarization_end is not None else chunk_start

            # Find the best matching speaker segment
            best_match = self.find_best_match(diarization, chunk_start, chunk_end)
            if best_match:
                speaker = best_match[2]  # Extract the speaker label
                speaker_transcriptions.append((speaker, chunk_start, chunk_end, segment_text))

        # Merge consecutive segments of the same speaker
        speaker_transcriptions = self.merge_consecutive_segments(speaker_transcriptions)
        return speaker_transcriptions

    def find_best_match(self, diarization, start_time, end_time):
        best_match = None
        max_intersection = 0

        for turn, _, speaker in diarization.itertracks(yield_label=True):
            turn_start = turn.start
            turn_end = turn.end

            # Calculate intersection manually
            intersection_start = max(start_time, turn_start)
            intersection_end = min(end_time, turn_end)

            if intersection_start < intersection_end:
                intersection_length = intersection_end - intersection_start
                if intersection_length > max_intersection:
                    max_intersection = intersection_length
                    best_match = (turn_start, turn_end, speaker)

        return best_match

    def merge_consecutive_segments(self, segments):
        merged_segments = []
        previous_segment = None

        for segment in segments:
            if previous_segment is None:
                previous_segment = segment
            else:
                if segment[0] == previous_segment[0]:
                    # Merge segments of the same speaker that are consecutive
                    previous_segment = (
                        previous_segment[0],
                        previous_segment[1],
                        segment[2],
                        previous_segment[3] + segment[3]
                    )
                else:
                    merged_segments.append(previous_segment)
                    previous_segment = segment

        if previous_segment:
            merged_segments.append(previous_segment)

        return merged_segments

    def get_last_segment(self, annotation):
        last_segment = None
        for segment in annotation.itersegments():
            last_segment = segment
        return last_segment

Python

4. Potential Uses of Transcriptions

Automatic transcriptions, enhanced by diarization and audio segment alignment, offer numerous practical applications across various fields. They enable efficient utilization of audio and video content by facilitating analysis, indexing, and structuring in a format exploitable by LLMs.

Broadcast Summarization

Transcriptions are particularly useful for generating summaries of television or radio broadcasts. Once the audio is transcribed and structured by speaker, an algorithm can extract the most relevant passages and condense the key information.

Example Application:

A 30-minute news podcast can be transcribed and analyzed to extract:

The main topics covered (politics, economy, culture).
Key quotes from guests.
An automatic summary of the discussions, enabling quick content review.

This type of solution is commonly used by media outlets and streaming platforms to provide automatically generated summaries, improving content visibility, accessibility, and navigation.

Meeting Minutes Generation

In the professional world, transcriptions help automate the creation of meeting minutes, eliminating the need for participants to take manual notes.

Example Application:

A company holds a weekly meeting where multiple collaborators discuss various topics. By using an ASR + diarization pipeline:

The audio is automatically transcribed.
Each speech segment is assigned to the correct speaker.
Post-processing structures the final document with clear sections (decisions made, tasks to complete, pending issues).

At the end of the meeting, a formatted report is generated and automatically sent to participants, ensuring better traceability of discussions and decision-making.

Sentiment Analysis

Enhanced transcriptions can also be used to perform sentiment analysis in conversations, speeches, or debates. Using natural language processing (NLP) algorithms, it is possible to detect emotional tone and identify trends in dialogues.

Example Application:

A company wants to analyze customer service phone calls to detect user satisfaction levels. By applying sentiment analysis to transcriptions:

Keywords and positive or negative expressions are identified.
A satisfaction score is assigned to each call.
Calls requiring special attention are automatically flagged for managers.

This type of analysis is particularly useful in call centers, market research, or evaluating customer feedback on social media.

Other Applications

Transcriptions can be leveraged across various other domains, depending on specific needs and objectives:

Accessibility for the hearing impaired: Automatic subtitles improve comprehension of audio content.
Content indexing and search: Enables quick retrieval of specific passages from large volumes of recorded audio (court archives, online courses, conferences, radio programs).
Automatic translation: Once transcribed, conversations can be translated into multiple languages, facilitating the global distribution of content.
Improving voice assistants: ASR models enhance AI systems’ ability to understand and respond to users more effectively.

With the rise of large language models (LLMs), converting spoken information into text format has become essential for processing, analyzing, and utilizing it at scale. Transcription and diarization technologies serve as crucial tools for extracting structured information from audio content and making it accessible to AI tools, thereby facilitating their integration into various use cases.

5. Challenges and Future Perspectives

Challenges Faced

Despite advances in speech recognition and diarization, several challenges persist and can impact the quality of transcriptions and audio segmentations.

Some of the main difficulties encountered include:

Background noise and sound disturbances: In noisy environments (street, open-space offices, conferences), the vocal signal can be distorted, reducing transcription accuracy.
Accents and linguistic variations: ASR models must be able to understand a wide range of accents and dialects, which can affect transcription reliability.
Overlapping speech: When multiple speakers talk simultaneously, diarization struggles to distinguish vocal segments, sometimes attributing parts of a sentence to the wrong speaker.
Speaker recognition in long conversations: In extended discussions, assigning segments to the same speaker can lose consistency, especially when intonation changes or context shifts.

Improving these models and algorithms remains crucial to ensuring better-quality transcriptions and segmentations.

The Challenge of Real-Time Processing

Real-time processing is another major challenge. For applications such as live subtitles, voice assistants, or meeting assistance, systems must analyze and transcribe audio instantly without significant performance degradation.

Key constraints related to real-time processing include:

Processing latency: ASR and diarization models require computation time, which can slow down real-time execution.
Hardware resource optimization: Fast processing often requires GPUs or TPUs, which can be costly in terms of resources and energy consumption.
Accuracy vs. speed: A balance must be found between instant transcription and a slower but more reliable output, directly impacting the practicality and usability of these tools.

Potential solutions to enhance performance include neural architecture optimization, model latency reduction through techniques like quantization, and adapting models to continuous data streams.

Future Perspectives

The next advancements in ASR and diarization technologies are expected to improve their efficiency and accuracy, particularly through the integration of advanced language models and real-time performance optimization.

Key areas of improvement include:

More robust and multilingual models: Training on more diverse and richer datasets will enhance transcription accuracy across different languages and contexts.
Better handling of accents and background noise: The use of speech enhancement techniques and intelligent filtering could mitigate the impact of background noise and adapt better to pronunciation variations.
ASR + LLM fusion for better contextual understanding: Integrating LLMs with ASR can help correct transcription errors by considering the overall context of the speech.
Optimization for embedded and mobile systems: Reducing the computational complexity of models will allow their deployment on low-power devices (smartphones, voice assistants, embedded applications).
Real-time interaction and AI participation in conversations: Improved real-time analysis and extraction of audio data will pave the way for AI tools capable of understanding and responding instantly to discussions. This evolution will enable voice assistants and conversational agents to actively participate in conversations by providing contextualized responses or assisting with note-taking and real-time summarization.

The future of transcription and diarization technologies lies in a hybrid approach, combining advanced artificial intelligence, hardware optimization, and improved acoustic models. These innovations will lead to the development of more intuitive and responsive applications, seamlessly integrating into everyday interactions. They will enhance communication with AI, making voice assistants more fluid, meetings more productive with instant feedback, and multimodal exchanges more immersive through better contextual and intent understanding.

6. Conclusion

Automatic transcription and diarization technologies enable more advanced utilization of audio data. Whisper and Pyannote, when combined, provide accurate speaker-segmented transcriptions, making them useful in various contexts. Their local deployment or use in private infrastructures ensures better data confidentiality.

Their applications are vast: broadcast summaries, meeting minutes, sentiment analysis, accessibility. However, challenges remain, particularly in handling noise, accents, and real-time processing. The future of these technologies depends on more efficient models and deeper integration with conversational AI.

Beyond transcription and conversation analysis, speech recognition paves the way for more natural interactions with artificial intelligence. By making exchanges smoother and more context-aware, these advancements will enable new use cases, such as more responsive voice assistants, conversational agents that actively participate in discussions, and real-time annotation and analysis systems.

Written by

Jean-Jerome Levy

DevOps Consultant

Seasoned professional in the field of information technology, I bring over 20 years of experience from working within major corporate IT departments. My diverse expertise has played a pivotal role in a myriad of projects, marked by the implementation of innovative DevOps practices.

Whisper and Pyannote: The Ultimate Solution for Speech Transcription

1. Understanding Automatic Speech Recognition (ASR)

Basic Functioning

The Whisper Model

Advantages and Features:

Implementing Whisper in Python

2. Diarization: Identifying Speakers

Diarization Techniques

The Pyannote Model

Capabilities and Advantages of Pyannote:

Implementing Pyannote in Python

3. Aligning Transcriptions with Audio Segments

Alignment Algorithms

Alignment Methodology Used

Implementation in Python

4. Potential Uses of Transcriptions

Broadcast Summarization

Example Application:

Meeting Minutes Generation

Example Application:

Sentiment Analysis

Example Application:

Other Applications

5. Challenges and Future Perspectives

Challenges Faced

The Challenge of Real-Time Processing

Future Perspectives

6. Conclusion

Jean-Jerome Levy

You may also like...

Why Moshi STT Could Replace Whisper (and How to Install It on macOS!)

How to Generate Voice Conversations with AI to Test a Transcription Tool

Drone Swarms: Collective Intelligence in Action

CrewAI, Ollama, Mixtral: An Agile Squad in Your Computer