Trying out Open AI whisper

I was quite fascinated when Open AI released whisper.

Audio to Text was possible in the cloud for quite some time. What interested me, is that now it's full offline solution with decent quality

Previously

Solution one was to use a built-in Windows voice-to-text solution with a virtual mic. The only caveat was that it wasn't stable enough and was taking a lot of time.

Second attempt

I went into the discussion forum for the whisper repo and found a solution (Podalize) that was even cooler. It promised that it would split any audio into tracks and transcribe it.

The difficult part is to make it run. It was said that it was tested only on Linux. My goal was to make it run on windows.

Using WSL

That was my first idea, I quickly installed it, and then thought that I need UI support, and WSL is not that great yet with it. So I decided to install anaconda on windows

Anaconda

That was an attempt that I spent 2h on. Downloading and installing it was easy. However, following commands from the repo, gave me a bunch of errors. I've tried to google them, chat gpt3 them, and manually install python packages. Nothing seemed to work. I gave up eventually, thinking that it was very Linux configured, and it would take me a few hours more to install package after package without significant hope.

One of the multiple issues I had with Anaconda on Windows

Azure Linux

Since I have free credits on Azure, I went and spin a new VM. I've tried to get some spot pricing for it, but one provisioning failed, saying that there are no slots. Then I googled which region has cheaper VMs, but without significant luck. So default option was selected.

After provisioning VM, updates, and setting up RDP, I connected and installed Anaconda again but on Ubuntu. This time luck was on my side and I managed to launch Podalize.

However, after I tried to actually use it, it didn't produce any output. Neither the file nor youtube option worked for me. After a few re-launches, I gave up on this option too.

Using Whisper on Windows

At this time, in parallel with trying Linux VM, I've decided to run whisper directly on Windows. This seemed to be an easier task. Having Anaconda help to run it, and after 6min of waiting, it outputs text into the console!

Doing things in parallel

While processing wav file, it put some pressure on my old laptop

Short Summary

I've managed to run a few transcriptions on English files, for raw phone recording, it produced good results only on a "small" model, and for a 3-minute wav file, took 15min to transcribe. However, it's possible due to my relatively old CPU.