Python Conference APAC 2024

Speeding Up Transcribing with GPUs and TPUs
2024-10-26 , CLASS #6
Language: English

Automatic Speech Recognition (ASR, a.k.a. speech-to-text) , also known as speech-to-text, is a valuable technology, with models like Whisper empowered by Python. Whisper is available via APIs; however, it takes a long time to process voice data. I would like to introduce several ways to speed up the Whisper model using local/remote GPUs and TPUs in Google Cloud.


Automatic Speaker Recognition (ASR) is a technology that automatically transcribes voice into text. This technology is useful in not only just make subtitles, but also write summaries. The ASR can be used via APIs provided by tech giants, but it can be time-consuming, especially during long meetings. For example, OpenAI provides the Whisper model, but the latest model is still unavailable via API. In this presentation, I will introduce several tips for speeding up ASR using the latest Whisper large-v3 model.

Unfortunately, the official local model from OpenAI does not support using three or more GPUs in one process. As an alternative, I will introduce the "faster-whisper" model, an extended version of the official one. This model is empowered by CTranslate2 and is almost twice as fast as the original. Based on this, I will also provide an advanced example of accelerating transcription from multiple voice files using multiple GPUs with a multithreading (concurrent.futures) module.

Additionally, I will introduce "whisper-jax," which is available for use with TPUs. This module's backend is JAX, which adopts functional programming. I won't go into details of the paradigm in this session, but it's a helpful tool to reduce processing time from tens of minutes to one minute.

Finally, I will compare the processing times of all these methods and highlight important points to consider when using them.

Most code examples are within 10 lines, making them accessible and useful for beginners using ASR, a well-known application in machine learning empowered by Python.

Agenda

  1. Introduction to ASR
    - How does ASR automatically transcribe speech?
    - What exactly is an applications of ASR?
  2. Limitations of ASR using the Whisper API / official local model
    - How long does it take?
  3. Implementation of faster-whisper with GPUs
    - How does the number of GPUs affect processing multiple voice files?
  4. Acceleration using whisper-jax with TPUs
    - What is the fastest processing time?
  5. Comparison of the different methods
    - What should we note when using these methods?

Each topic has five minutes, and we plan to have a QA session in the last five minutes.

Yuta is a machine learning engineer at Churadata Inc. in Okinawa, Japan, where he has been building innovative AI solutions since Spring 2024. He is passionate about leveraging Python to explore the fascinating world of NLP and voice processing.
Currently pursuing his PhD at the University of Electro-Communications, Yuta's research focuses on the intersection of technology and society, specifically on the automatic detection of disinformation in social media.
In his spare time, Yuta enjoys exploring the beautiful island of Okinawa and indulging in his hobby of cosplaying as characters from his favorite Japanese video games.