N Recursions: January 2023

15 January 2023

Comparing video quality by viewing videos side by side

When you convert videos to a different quality, you'd want to compare them. While there are objective methods of comparing it, the subjective evaluation by a human is generally more accurate. When looking for solutions, I found a brilliant solution named Vivict. It not only allows you to compare streaming videos, it also allows you to compare local videos, by using the little white icon near the address bar. The best part of Vivict, is that it shows half of one video and half of another, and allows you to move the mouse to view more or less of the video.

Limitation of Vivict: It does not support all video file formats. When I wanted to play an mkv file, it didn't work.

Another solution, is to simply use ffplay like this:

ffplay -f lavfi "movie=leftVideo.mp4,scale=iw/2:ih[v0];movie=rightVideo.mp4,scale=iw/2:ih[v1];[v0][v1]hstack"

Comparing videos using ffplay

A simpler solution:

I created a program (https://github.com/nav9/splitVideoQualityViewer) similar to Vivict which shows the videos like this:

Feel free to either fork and modify the repository or contact me on the repository's discussion tab, to collaborate.

01 January 2023

Automatically detecting audio video async and correcting it

It's a bit disorienting when in you see a lack of sync in the lip movement of a person and the audio. There are various reasons this can happen. If on a live stream, it could be because of network congestion or because the program doing the video encoding didn't get enough of CPU cycles. It can also happen when re-encoding video or splitting it.

Whether you want to measure how much of an async there is or even to fix it, machine learning has thankfully come to our aid. I created a simplistic open source program that finds periods of time when a person begins to speak after a silence, and matches it with audio to find out how much of an async there is, and then I use FFMPEG to fix the async.

Code: https://github.com/nav9/audio_video_synchronizer

MediaPipe was used to identify the lip movements. I was surprised at how robust it is. Even if the person's face is turned sideways, it still tracks and estimates the face point positions as x,y,z points. Almost 400 points are tracked.

To identify human speech in the audio, I used a voice activity detection algorithm.

There's a lot more work to be done to make this program generic. Ideally, a Recurrent Neural Network (perhaps even LSTM) could be used to analyze lip movements temporally and figure out which lip movements are actually speech and which aren't. Then a probability score needs to be generated based on the actual sounds detected and matched with the lip movements. There are pre-trained models like Vosk, which can be used offline to estimate which words are spoken. Python's speech_recognition package comes in handy here.