Chapter 8: OpenAI whisper
In this chapter, I will show you how to use OpenAI Whisper
for audio transcription and diarization. Whisper is a versatile tool that helps convert audio recordings into text, lending itself well for tasks like transcribing interviews, radio shows, or any other type of recorded speech. Additionally, we will use speaker diarization to identify different speakers in the audio.
Throughout this chapter, we will use reticulate
for integrating Python code within our R workflow, pydub
for audio manipulation, openai-whisper
for audio transcription, torch
for running deep learning models, numpy
for numerical operations, and pyannote.audio
for speaker diarization.
Install Python using reticulate
and miniconda
Like in the chapter on selenium
, we first create a conda
environment with all the required packages.
needs(reticulate)
#reticulate::conda_create(envname = "pyenv/whisper_env") # create empty environment
::conda_install(envname = "pyenv/whisper_env",
reticulatepackages = c("pydub", "openai-whisper", "torch", "numpy"),
pip = TRUE) # install packages into a certain environment
Thereafter, we need to make sure that we activate our environment. Moreover, for pyannote
, you will need an access token from huggingface (get yourself a “read” key here: https://huggingface.co/settings/tokens). I stored mine in my R environment and feed it forward to Python, so that it is readily accessible without others seeing it.
needs(reticulate)
use_condaenv(condaenv = "pyenv/whisper_env")
Then we can load the required Python packages.
import torch
import whisper
from pyannote.audio import Pipeline
import wave
import os
from pydub import AudioSegment
from pyannote.core import Segment
import numpy as np
import pandas as pd
from scipy.io import wavfile
Also, if we have a GPU available – as you probably should if you’re running these operations on a server – we need to tell our packages that they can use the GPU instead of the CPU.
= torch.device("mps" if torch.backends.mps.is_available() else "cpu") # I'm using MPS here because I'm on a Mac
device # device = device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # if you're on a server with a suitable GPU
= whisper.load_model("base").to("cpu")
whisper_model #whisper_model = whisper.load_model("base").to(device) # if you're on a server with a suitable GPU
# Load diarization pipeline
= Pipeline.from_pretrained("pyannote/speaker-diarization", use_auth_token = hf_token)
pipeline pipeline.to(device)
Since this script is for students of Leipzig University first and foremost, I had to make some changes to it so that it can run nicely on the university’s server. It’s quite a chore to install ffmpeg
there – which whisper
requires by default to read in sound. Here, we skip this step, but due to this the transcription function needs to be rewritten from scratch. However, this comes with the shortcoming of the model in this script only accepting .wav files. This can be circumnavigated by using the audioread
library (find more infos here).
def load_audio_manually(file_path, target_sr=16000):
= wavfile.read(file_path)
sr, audio if audio.ndim > 1:
= np.mean(audio, axis=1)
audio if sr != target_sr:
import librosa
= librosa.resample(audio.astype(np.float32), orig_sr=sr, target_sr=target_sr)
audio return audio
def transcribe_audio(file_path): #transcribe it
= load_audio_manually(file_path)
audio = audio.astype(np.float32) / 32768.0 # Normalize if original was int16
audio = whisper_model.transcribe(audio)
result return result
Now we canmake our first transcription:
= transcribe_audio("mlk_ihaveadream.wav")
ihaveadream_transcript "text"]
ihaveadream_transcript["segments"][0] ihaveadream_transcript[
Diarization
We can also use speaker diarization to split audio files by speaker and transcribe each segment separately. This is particularly useful if the recording contains multiple speakers. Here, we use the AudioSegment
class from pydub
to load the audio file. The pipeline
object is used to iterate over the speaker turns, and we save each speaker segment to a separate audio file.
= AudioSegment.from_wav("temp/thisisamerica_200_snippet.wav")
audio
for turn, _, speaker in pipeline("temp/thisisamerica_200_snippet.wav").itertracks(yield_label=True):
= turn.start * 1000
start_time = turn.end * 1000
end_time = audio[start_time:end_time]
segment_audio = f"temp/carrboro_market/{speaker}-{int(turn.start)}-{int(turn.end)}.wav"
segment_file format = "wav")
segment_audio.export(segment_file, #print(f"Saved segment: {segment_file}")
Next, we want to transcribe each of the diarized segments and save the results to a CSV file.
import pandas as pd
import glob
# Initialize lists to store each attribute separately
= []
speakers = []
start_times = []
end_times = []
texts
def transcribe_and_collect(file_path, speaker, start_time, end_time):
# Perform transcription
= transcribe_audio(file_path) # Assuming transcribe_audio function exists
result # Append each attribute to its respective list
speakers.append(speaker)
start_times.append(start_time)
end_times.append(end_time)'text'])
texts.append(result[
# Iterate over diarized segments (assuming you have diarization data)
for segment_file in glob.glob("temp/carrboro_market/SPEAKER_*.wav"):
= segment_file.split('-')
parts = parts[0].split("/")[2]
speaker = float(parts[1])
start_time = float(parts[2].split('.')[0])
end_time
transcribe_and_collect(segment_file, speaker, start_time, end_time)
# Write the DataFrame to a CSV file
= pd.DataFrame({
transcriptions_df "speaker": speakers,
"start": start_times,
"end": end_times,
"text": texts
})
transcriptions_df# Write the DataFrame to a CSV file
"temp/transcriptions.csv", index=False) transcriptions_df.to_csv(
So that we can finally read it back in and wrangle the results in R.
<- readr::read_csv("temp/transcriptions.csv") |>
transcriptions_df ::arrange(start) dplyr
Rows: 33 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): speaker, text
dbl (2): start, end
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
transcriptions_df
# A tibble: 33 × 4
speaker start end text
<chr> <dbl> <dbl> <chr>
1 SPEAKER_01 0 14 community ties that were made tend to bring folks tha…
2 SPEAKER_00 14 17 any other big highlights over the years. Yeah.
3 SPEAKER_01 17 26 So because of its horizontal structure, there is no o…
4 SPEAKER_01 27 40 either provide a service of cooking food on the spot.…
5 SPEAKER_01 41 48 When they're in season, there's always that one perso…
6 SPEAKER_01 49 60 Sometimes we'll have musicians, we've had people come…
7 SPEAKER_01 61 71 We've had folks come around just to help other folks …
8 SPEAKER_01 72 74 I do know that um...
9 SPEAKER_01 74 94 There are occasionally collaborations with other town…
10 SPEAKER_01 95 105 We'll have like student groups who will use the reall…
# ℹ 23 more rows
So here we are, a quick introduction to transcription and diarization in Python.
Further links
whisper
GitHub repositoryPyannote
audio documentation