Building a Custom Auto-Subtitle Generator using Whisper.cpp and OpenCV: A Beginner’s Guide

Introduction

In recent years, the need for automation in various aspects of life has increased significantly. One such area where automation can make a significant impact is in video subtitles. Creating subtitles for videos can be a time-consuming task, especially when there are multiple languages involved. In this blog post, we will explore how to build a custom auto-subtitle generator using Whisper.cpp and OpenCV.

What is Whisper.cpp?

Whisper.cpp is an open-source speech recognition library that was developed by the MIT Media Lab. It provides a simple API for recognizing spoken words in audio files. The library uses deep learning models to recognize spoken words, making it highly accurate.

What is OpenCV?

OpenCV (Open Source Computer Vision Library) is a computer vision library that provides a wide range of functions for image and video processing. It includes functionalities for object detection, feature extraction, and video analysis. In this project, we will use OpenCV to extract the audio from the video and then pass it through Whisper.cpp for speech recognition.

Setting up the Environment

Before we begin building our auto-subtitle generator, we need to set up the environment. We will need to install both Whisper.cpp and OpenCV on our machine. Here’s how you can do it:

  • For Whisper.cpp:
    pip install whisper
  • For OpenCV:
    pip install opencv-python

Building the Auto-Subtitle Generator

Now that we have set up the environment, let’s build our auto-subtitle generator. The first step is to extract the audio from the video. We can do this using OpenCV.

#include <opencv2/opencv.hpp>

int main() {
    cv::VideoCapture capture("input.mp4");
    if (!capture.isOpened()) {
        std::cerr << "Failed to open the video file." << std::endl;
        return 1;
    }

    cv::Mat frame;
    while (true) {
        capture.read(frame);
        if (frame.empty())
            break;

        // Extract audio from the video
        cv::Mat audio;
        cv::extractAudio(frame, audio);

        // Convert audio to WAV format
        cv::imwrite("audio.wav", audio);

        break; // Stop reading after first frame
    }

    capture.release();
    return 0;
}

In this code, we use OpenCV’s VideoCapture class to read the video file. We then extract the audio from each frame using the extractAudio function and convert it to WAV format using the imwrite function.

Next, we need to pass the extracted audio through Whisper.cpp for speech recognition. Here’s how you can do it:

#include <whisper/whisper.h>

int main() {
    whisper::Whisper w;
    std::string filename = "audio.wav";
    whisper::AudioFile af(filename);
    whisper::Model m("en");

    // Load the audio file into Whisper.cpp
    w.load(af, m);

    // Recognize spoken words in the audio file
    whisper::Result r = w.recognize();

    std::cout << "Recognized text: " << r.text << std::endl;

    return 0;
}

In this code, we create a Whisper object and load an audio file into it. We then recognize spoken words in the audio file using the recognize function.

Putting it all Together

Now that we have built our auto-subtitle generator, let’s put it all together. Here’s how you can do it:

#include <opencv2/opencv.hpp>
#include <whisper/whisper.h>

int main() {
    cv::VideoCapture capture("input.mp4");
    if (!capture.isOpened()) {
        std::cerr << "Failed to open the video file." << std::endl;
        return 1;
    }

    cv::Mat frame;
    while (true) {
        capture.read(frame);
        if (frame.empty())
            break;

        // Extract audio from the video
        cv::Mat audio;
        cv::extractAudio(frame, audio);

        // Convert audio to WAV format
        cv::imwrite("audio.wav", audio);

        // Pass extracted audio through Whisper.cpp for speech recognition
        whisper::Whisper w;
        std::string filename = "audio.wav";
        whisper::AudioFile af(filename);
        whisper::Model m("en");

        w.load(af, m);
        whisper::Result r = w.recognize();

        std::cout << "Recognized text: " << r.text << std::endl;

        break; // Stop reading after first frame
    }

    capture.release();
    return 0;
}

In this code, we combine the previous two examples into one. We extract the audio from each frame using OpenCV and then pass it through Whisper.cpp for speech recognition.

Conclusion

In this blog post, we have explored how to build a custom auto-subtitle generator using Whisper.cpp and OpenCV. We have set up the environment, extracted the audio from the video, passed it through Whisper.cpp for speech recognition, and finally combined all the steps together into one program. This project demonstrates the potential of deep learning models in computer vision tasks.

Future Work

There are several ways to improve this project further:

  • Improve Speech Recognition Accuracy: Whisper.cpp is a simple speech recognition library that can be improved by using more advanced machine learning models.
  • Support Multiple Languages: Currently, Whisper.cpp only supports English. To support multiple languages, we would need to train separate deep learning models for each language.
  • Handle Noisy Audio: The current implementation of Whisper.cpp does not handle noisy audio well. We could improve this by using noise reduction techniques before passing the audio through Whisper.cpp.

In conclusion, building a custom auto-subtitle generator using Whisper.cpp and OpenCV is a complex task that requires a good understanding of computer vision and machine learning. However, with the right tools and techniques, it can be achieved.