Introduction to Audio & Speech Recognition
2022-08-29 , HS 120

The audio (& speech) domain is going through a massive shift in terms of end-user performances. It is at the same tipping point as NLP was in 2017 before the Transformers revolution took over. We’ve gone from needing a copious amount of data to create Spoken Language Understanding systems to just needing a 10-minute snippet.

This tutorial will help you create strong code-first & scientific foundations in dealing with Audio data and build real-world applications like Automatic Speech Recognition (ASR) Audio Classification, and Speaker Verification using backbone models like Wav2Vec2.0, HuBERT, etc.


Unlike general Machine Learning problems where we either classify i.e. segregate a data point into a pre-defined class or regress around a continuous variable, audio related problems can be slightly more complex. Wherein, we either go from an audio representation to a text representation (ASR) or separate different layers of audio (Diarization) and so on. This tutorial will not only help you build applications like these but also unpack the science behind them using a code-first approach.

Every step of the way we’ll first write and run some code and then take a step back and unpack it all till it makes sense. We’ll make science fun again :)

The tutorial will be divided into 3 key sections:

  1. Read, Manipulate & Visualize Audio data
  2. Build your very own ASR system (using pre-trained models like Wav2Vec2.0) & deploy it
  3. Create an Audio Classification pipeline & infer the model for other downstream audio tasks

At the end of the tutorial, you’ll develop strong intuition about Audio data and learn how to leverage large pre-trained backbone models for downstream tasks. You’ll also learn how to create quick demos to test and share your models.

Libraries: HuggingFace, SpeechBrain, PyTorch & Librosa


Public link to supporting material

https://github.com/Vaibhavs10/ml-with-audio

Abstract as a tweet

Learn how to build, demystify and deploy State-of-The-Art audio models, all in less than 3 hours!

Project Homepage / Git

https://github.com/Vaibhavs10/ml-with-audio

Domains

Machine Learning, Open Source Library

Expected audience expertise: Domain

some

Expected audience expertise: Python

some

I am a Data Scientist and a Masters Candidate - Computational Linguistics at Universität Stuttgart. I am currently researching on Speech, Language and Vision methods for extracting value out of unstructured data.

In my previous stint with Deloitte Consulting LLP, I worked with Fortune Technology 10 clients to help them make data-driven (profitable) decisions. In my surplus time, I served as a Subject Matter Expert on Google Cloud Platform to help build scalable, resilient and fault-tolerant cloud workflows.

Before this, I have worked with startups across India to build Social Media Analytics Dashboards, Chat-bots, Recommendation Engines, and Forecasting Models.

My core interests lie in Natural Language Processing, Machine Learning/ Statistics and Cloud based Product development.

Apart from work and studies, I love travelling and delivering Workshops/ Talks at conferences and events across APAC and EU, DevConf.CZ, Berlin Buzzwords, DeveloperDays Poland, PyCon APAC (Philippines), Korea, Malaysia, Singapore, India, WWCode Asia Connect, Google DevFest, and Google Cloud Summit.