01 January 2023

Automatically detecting audio video async and correcting it

It's a bit disorienting when in you see a lack of sync in the lip movement of a person and the audio. There are various reasons this can happen. If on a live stream, it could be because of network congestion or because the program doing the video encoding didn't get enough of CPU cycles. It can also happen when re-encoding video or splitting it.

Whether you want to measure how much of an async there is or even to fix it, machine learning has thankfully come to our aid. I created a simplistic open source program that finds periods of time when a person begins to speak after a silence, and matches it with audio to find out how much of an async there is, and then I use FFMPEG to fix the async.  

Code: https://github.com/nav9/audio_video_synchronizer

MediaPipe was used to identify the lip movements. I was surprised at how robust it is. Even if the person's face is turned sideways, it still tracks and estimates the face point positions as x,y,z points. Almost 400 points are tracked.

To identify human speech in the audio, I used a voice activity detection algorithm.

There's a lot more work to be done to make this program generic. Ideally, a Recurrent Neural Network (perhaps even LSTM) could be used to analyze lip movements temporally and figure out which lip movements are actually speech and which aren't. Then a probability score needs to be generated based on the actual sounds detected and matched with the lip movements. There are pre-trained models like Vosk, which can be used offline to estimate which words are spoken. Python's speech_recognition package comes in handy here.


No comments: