Timestamp Generation

Creating Audiobooks by aligning Audio with text

One fun thing I did for my master’s dissertation was to align any Quranic Recitation with the quranic text. One way to generate an audio book is through force alignment using the Needleman-Wunsch algorithm.

Force alignment is a technique used to align the spoken words in an audio recording with the written text of a book. This allows for the creation of a synchronous audio version of the book, with each word in the audio corresponding to the written text.

The Needleman-Wunsch algorithm is a dynamic programming algorithm that is commonly used for sequence alignment in bioinformatics. To use the Needleman-Wunsch algorithm for force alignment, the audio and text are first transcribed into a series of discrete units, such as words or phonemes. The algorithm then compares these units, looking for matches between the audio and text.

Here are the steps to do it:

Generate transcriptions using OpenAi Whisper or Huggingface Wav2vec2.0. model. In my case I finetuned a wav2vec model using datasets from everyayah.com
Align the transcription with the original text and apply Needleman-Wunsch algorithm. It takes a lot of time to align long texts and the most performant library was biopython, which is actually used for bioinformatics.
Generate a subtitle/timing csv that can then be used by audiobook readers. I am currently using it to generate word level and ayah level timestamps for the Qari Ayman Suwaid.

I want to keep it short but if anybody wants to learn more about this do hit me up. #quran #wav2vec2 #forcealignment

Here is a demonstration: https://www.youtube.com/watch?v=3m-Us0Iaod8