How to enhance 22.05 kHz speech audio material for listening purposes to 44.1 kHz?

I have found a really interesting radio novel on the web, and I would like to attract attention of one of my acquaintances to it. Unfortunately the audio material has poor quality, only 22.05 kHz and 1 channel, mono. However it doesn't contains music, only speech. Generally speaking it sounds like an old radio, or an old telephone. I would like to enhance it a bit if possible, before sending it to my friend. What software should I use, and what operations should I carry out on the audio file to make it sound a bit better?

Konstantin

Posted 2018-07-30T16:52:43.607

Reputation: 515

1Can you share a sample of the audio? – Attie – 2018-07-30T18:20:33.323

Yes, of course: https://drive.google.com/open?id=1Sz8YF-fbDI5MoCnXuVNYyPq6-7O_rAD8

– Konstantin – 2018-07-30T18:32:26.653

Maybe you can run it through a super sophisticated speech reconstruction model, as described here. I’m not at all familiar with the requirements though.

– Daniel B – 2018-07-31T08:29:31.883

Answers

If the sample rate to record the voice has been 22kHz you can't just enhance it by setting it to 44kHz. You can compare it to a bitmap image: you won't get more details by making "the pixels bigger". Same with mono/stereo. If you have a mono recording you can not make it into a stereo recording. It only works the other way around, e.g. making stereo into mono.

However, if there are other "problems", e.g. certain parts of the recording not having enough volume you might be able to correct this or smooth out abrupt changes etc. But this depends on the type of problem, there is no general solution. You should get familiar with the topic so you know what the "technical problem" is and then you should try to find a solution. If you have problems applying this solution (of finding a solution to a very specific acoustic problem) it would be a good point to ask again on that specific topic.

Albin

Posted 2018-07-30T16:52:43.607

Reputation: 3 983

I see, but when I enlarge digital images, they are also resampled in a certain sense, we can say they are rescaled. And there are bad, good, and even better rescaling algorithm for images: nearest neighbour, bilinear, bicubic, lanczos, etc, to interpolate the missing pixels. I thought there must be a similar approach for audio files too. – Konstantin – 2018-07-30T18:22:07.127

2@Konstantin yes, there are several "filters" or other manipulations you can use on audio, analogue to the way you enhance on images. But unfortunately there is no general method to make images "better". You can try a few algorithms without really knowing what you are doing, and see if you like the image better. If that doesn't work you need to get more know how so you can analyse your specific problem. Same goes for audio. – Albin – 2018-07-30T18:25:23.970

22.05 kHz isn't "poor quality" as far as spoken word goes... most of the Audible library has a sample rate of 22.05 kHz - even for the "high quality" files.

If the recording "sounds bad", then it's probably due to something else:

bit-depth (8-bit vs 16-bit)
compression (low bit-rate MP3 vs AAC or OGG)
microphone (cheap vs not quite so cheap)
positioning of microphone vs reader
original medium (analog vs digital / cassette tape vs MiniDisc or PC)
a previous up-sample from a far lower sample rate (which is what you're trying to do now).

Either way, the information is now lost, and it will be hard to get back. The best you can probably do without spending a lot of time on it is to tweak an EQ to make it sound more acceptable.

The sample you provided doesn't sound too bad to me at all (though I don't speak the language, so may be missing some nuances...).

I'd look to tweak the EQ slightly and "normalize" the audio to bring the level up - you may find that what you think is a poor recording is actually the noise in your system becoming more apparent from turning the volume up high.

The waveform changes as shown below (using Audacity), before (top) and after (bottom):

There is a bit of reverberation in the recording (which will likely have come from the room, and possibly being a little bit too far from the microphone). However there is minimal background noise (hence the narrow sections of the waveform), no distortion, and only a single pop in the whole file (not shown above).

Attie

Posted 2018-07-30T16:52:43.607

Reputation: 14 841

As already mentioned, recording at 22.05kHz for spoken word isn't in itself 'bad'; but neither can it really be 'fixed' because there's no information in the recording to emphasise. You can only work with what's there already.

Some explanation... The human voice is really at its most distinct at around 2 - 6 kHz. That's where all the consonants are & what really helps the listener to decide what's actually being said; it's also why putting your fingers in your ears reduces comprehensibility, it mainly blocks these higher frequencies.
There is information in speech above 6kHz, but it trails away a lot above that & by 11kHz there's really very little useful information left.

So - for spoken word they use 22.05kHz as the sample frequency.
There's a very complex audio analysis called the Nyquist-Shannon Sampling Theorem often just referred to as the Nyquist Limit, which basically boils down to
"The highest audio frequency that can be recorded in an audio file is half the sampling frequency."
That equates to about 11kHz on a 22.05kHz recording.
That's plenty for a human voice.

It also means there is no longer any information above that to work with, even if you do change the sampling frequency up to 44.1kHz [CD audio quality].

On to your audio book.
The problem, as I hear it, is that the reader was a bit close to the mic. This emphasises lower frequencies, due to something called the proximity effect. No need to go into that in full here, but overall it's made the recording a bit bassy.
It's also been somewhat compressed - it's had the dynamic range reduced so the quiet bits are louder & the loud bits are quieter. This ought to help intelligibility, but it wasn't done quite as well as it could have been, & tends to emphasis the bass even more. The only reasoning I can think of for doing this is it makes the reader sound "more manly, more authoritative".. but doesn't actually help intelligibility in the slightest :/

What we need to do then is reduce the bass, emphasise the highs & try to de-emphasise some of the heavy compression.
Most of this could be done in Audacity, to greater or lesser degree, but I'm more comfortable in Cubase, so let me show you in there...

Most people would tell you to Normalise the file first.
Don't do this first - you will kill your potential headroom.
If you need to do it at all, do it last.

Also note you cannot "undo" the compression that has already been applied - that would be the equivalent of getting the eggs & flour back from a baked cake - instead you can only try to mitigate it in the most heavily-affected areas.

If all you have to work with is Equalisation, then you could try reducing levels below 250Hz, gently rolling off below that. You can then try to gain some consonants back by adding in an opposite slope above maybe 2 or 3 kHz.

I spotted an irritating click, or lip-smack at about 3:40, which I simply selected & turned down to zero - you could get all clever with a de-clicker, but it wasn't worth the effort.

My weapon of choice for any rescue operation like this is a multi-band compressor.
I found a free multi band comp for Audacity, though I haven't tried it myself, so YMMV - https://www.gvst.co.uk/gmulti.htm

I use the considerably more expensive Waves LinMB but the general idea is the same. This is how I have it set up...

From the image, you can see I'm hitting the low end really hard, to try to remove that excessive boom. The middle I'm pretty much leaving untouched. The highs I've increased their output level, whilst at the same time applied a slight compression just so some of the heavier S's etc don't get too punchy. Also, at this point I haven't increased the overall volume at all - we still have plenty of headroom to play with & it's best if when you switch your effect in & out for comparison that you're not just fooling yourself with the volume change.

Quick examples -
before...

https://soundcloud.com/graham-lee-15/antal-vegh-orig?in=graham-lee-15/sets/intelligibility-fix

after...

https://soundcloud.com/graham-lee-15/antal-vegh-linmb?in=graham-lee-15/sets/intelligibility-fix

At this point, once you're happy with how it sounds, now you can normalise.

^{Note my examples are at a higher sample-rate purely because I cannot export directly at 22.05. This does not materially affect the result in any way.}

Tetsujin

Posted 2018-07-30T16:52:43.607

Reputation: 22 456

One trick from working for images is to increase the bit-depth when working with gradients and then dither back down to 8-bit. This reduces or even eliminates visual banding. I am wondering if such a technique is useful in this context (increase bit depth, apply filters etc, then dither back down). – Yorik – 2018-07-31T17:28:40.103

Potentially. tbh, I lifted this to 16-bit 44.1 to work on, but I'm not sure how something like Audacity would deal with it. In & of itself, it *shouldn't* make any difference unless you're synthesising higher harmonics, which I thought would be a bridge too far for what would appear to be an entry-level query. Also, for solo spoken word, you really can get away with a 6kHz cutoff & still preserve full intelligibility, even if not 'nice hi-fi'. Think of what phones do to an audio signal :/ – Tetsujin – 2018-07-31T17:31:57.403

-1

Use Audacity which is an open source software. Here is the link https://www.audacityteam.org/

Check the following link to see if you could do something to improve your specific audio https://www.wikihow.com/Get-Higher-Audio-Quality-when-Using-Audacity

Saurav Kumar Sahu

Posted 2018-07-30T16:52:43.607

Reputation: 19

Please quote the essential parts of the answer from the reference link(s), as the answer can become invalid if the linked page(s) change. – DavidPostill – 2018-07-30T20:36:56.967