Is changing pitch enough for anonymizing a person's voice?

Question

In every TV program where there's a person that wants to remain anonymous, they change their voice in a way that to me sounds like a simple increase or decrease in pitch (frequencies). What I'm wondering is:

is the usual anonymizing method actually based on a simple change in pitch, or is it a more complex transformation that most TVs / media / etc. are using?
is a simple change in pitch enough to make it impossible, or very hard anyway, to recover the original voice? I would think that if a voice has been changed to have a higher pitch, by lowering the pitch I might try to get the original voice, but I'm not sure how hard or reliable it could be.

Note that I'm just talking about the voice quality, not about other features that of course could immediately deanonymize a person (like accent, dialect, personal vocabulary and slang, etc.)

Then again, the callers are usually not identified by their voices but by the fact that the distinctive whistle blow of a certain train and the sound of an extremely rare species of woodpecker are heard in the background ... — Hagen von Eitzen, Mar 13 '20 at 07:00

score 95 · Accepted Answer · answered Mar 11 '20 at 16:49

95

A simple pitch change is insufficient to mask a voice, as an adversary could simply pitch the audio back to recover the original audio.

Most voice modulators use a vocoder, not a simple pitch change. The term "vocoder" is unfortunately rather heavily overloaded these days, so to clarify I mean the type that is most generally used in music, rather than a phase vocoder, pitch remapper, or voice codec.

The way this works is as follows:

The voice input audio (called the modulation signal) is split into time slices, and its spectral content is analysed. In DSP this is usually implemented using an FFT, which effectively translates a signal from the time domain - a sequence of amplitudes over time - into the frequency domain - a collection of signals of increasing frequency that, if combined, represent the signal. In practice implementations output a magnitude and phase value for each of a fixed number of "buckets", where each bucket represents a frequency. If you were to generate a sine wave for each bucket, at the amplitude and phase offset output by the FFT, then add all of those sine waves together, you'd get a very close approximation of the original signal.
A carrier signal is generated. This is whatever synthesised sound you want to have your voice modulator sound like, but a general rule of thumb is that it should be fairly wideband. A common approach is to use synth types with lots of harmonics (e.g. sawtooth or square waves) and add noise and distortion.
The carrier signal is passed through a bank of filters whose center frequencies match that of the FFT buckets. Each filter's parameters are driven by its associated bucket's value. For example, one might apply a notch filter with a high Q factor and modulate the filter's gain with the FFT output.
The resulting modulated signal is the output.

A rather crude diagram of an analog approach is as follows:

The audio input is split into a number of frequency bands using band pass filters, which each pass through only a narrow frequency range. The "process" blocks take the results and perform some sort of amplitude detection, which then becomes a control signal to the voltage controlled amplifiers (VCAs). The path at the top generates the carrier waveform, usually by performing envelope detection on the input and using it to drive a voltage controlled oscillator (VCO). The carrier is then filtered into individual frequency bands by the bandpass filters on the right, which are then driven through the VCAs and combined into the output signal. The whole approach is very similar to the DSP approach described above.

Additional effects may be applied as well, such as pre- and post-filtering, noise and distortion, LFO, etc., in order to get the desired effect.

The reason this is difficult to invert is that the original audio is never actually passed through to the output. Instead, information is extracted from the original audio, then used to generate a new signal. The process is inherently lossy enough to make it fairly prohibitive to reverse.

answered Mar 11 '20 at 16:49

Polynomial

132,208
43
298
379

9

Polynomial is correct. Just to expand though, 'pitch' altering is merely shifting the entire voice signal up or down in the frequency domain. Thus, it's not actually altering the signal in any other way, and to recover the original, you merely just shift it back. Granted, you have to guess where that center frequency may have been originally, but human speech doesn't vary much in that respect, and one can merely guess what sounds right to their ears. Thus, a pitch is definitely not enough. – Jarrod Christman Mar 12 '20 at 14:06
3

Are there mathematical theorems establishing that the output is difficult to invert (however that is formalized)? – ComFreek Mar 12 '20 at 16:19
3

@comfreek 1) The process is Lossy. Information is being discarded. *"... only a narrow frequency range..."* This is like the old telephones that clipped frequencies above and below a certain limit. 2) There's random noise injected. Good luck removing that. 3) Some harmonics are lost. Those are key to build the voice timbre. https://en.wikipedia.org/wiki/Human_voice – Mindwin Mar 12 '20 at 17:10
2

@Mindwin These are all good pragmatic arguments. I was more seeking for a rigorously proven theorem -- in the same spirit as cryptography ensures eavesdropper/CPA/CCA security for certain symmetric key encryption algorithms. Perhaps human voice recognizability is just too complex to be easily modelled. – ComFreek Mar 12 '20 at 19:13
18

@ComFreek "information is lost" is enough to rigorously prove that the original voice cannot be *perfectly* reconstructed. So now it's just a matter of how close you want the "closest possible reconstruction" to be to the original voice, and that depends on your threat model. Has the adversary already narrowed down the possibilities to two very different-sounding people? Or is he trying to identify a completely unknown voice from among all humans? Very different levels of imperfection are needed between those cases. – JounceCracklePop Mar 12 '20 at 19:28
2

@ComFreek We don't have CCA/CPA proofs for cryptography with the exception of Vernam's cipher as far as I know. We have all kinds of equivalency proofs etc. but nothing like AES or RSA or any similar has ever been proved to be safe. Sometimes the things that are being proved equivalent are powerful and we tend to believe the precursors are hard of course, but every proof of CPA/CCA pretty much any other crypto result starts with assuming A we can show that B has some property. – DRF Mar 12 '20 at 19:40
I would think that you would want to alter the cadence of the speech as well. Phrasing, pauses, etc. can be distinctive. – shoover Mar 13 '20 at 05:55
2

If a change in pitch is trivial to reverse but a voice modulator is not, why is speech anonymized using the latter so often played at a pitch so much lower than a normal person's speech? – Will Mar 13 '20 at 08:49
3

@Will The vocoder effect itself has no specific effect on pitch. That pitch down has become somewhat of a trope from movies, probably because an unnaturally low pitch sounds menacing. It possibly also has some minor benefits in terms of audibility over a phone line as the upper frequency cutoff in most telephony systems is around 3.4KHz, although that's more of an educated guess than something I can concretely back up. – Polynomial Mar 13 '20 at 15:12
3

@shoover Characteristic phrasings and mispronunciations / misspellings are very often used as circumstantial evidence in cases where an investigator is trying to attribute communications (voice or text) to a specific person. Those types of traits certainly persist through voice modification systems, and it's practically impossible for a person to hide them effectively. Text-to-speech reduces the chance of leaking an identifiable trait, but is still susceptible to revealing information about you via particular turns of phrase or use of cultural idioms. – Polynomial Mar 13 '20 at 15:22
Of-course, for best effects, the process can be slightly randomized over time to make demodulation harder. – Mast Mar 13 '20 at 16:42
@comfreek your request is sound and valid. However, it is beyond the scope of this particular Q&A. You should open a new question asking exactly that. I'd upvote it. – Mindwin Mar 13 '20 at 17:30
Look up SIGSALY for the application of this to voice encryption back in WWII. Full-on PCM was impractical back then, and analog techniques always left something intelligible behind (for basically the same reason you can still see Tux in that ECB-encrypted bitmap: redundancy) so they used a vocoder, ran it at a low enough sample rate that the output could be MFSK-encoded and survive transmission over telephone, added encryption to that, and re-synthesized the speech on the receiver end. – hobbs Mar 14 '20 at 01:09

score 6 · Answer 2 · edited Jun 16 '20 at 09:49

tl;dr– It's not generally reversible, but it might still be reversed in practice.

Analogy: Reversibility of reducing a name to its length.

Consider a reduction method that takes in a person's first name and gives the number of letters in it. For example, "Alice" is transformed into 5.

This is a lossful process, so it can't be generally reversed. This is, we can't generally say that 5 necessarily maps to "Alice", as it might also map to, e.g., "David".

That said, knowing that the transform is 5 still contains a lot of information in that we can exclude any name that doesn't transform into 5. For example, it's obviously not "Christina".

So now say that you're a police detective, trying to solve a case. You've narrowed down the suspects to Alice and Bob, and you know that the culprit's anonymized name was 5. Sure you can't generally reverse 5, but, does that theoretical point really help Alice in this case?

Point: Lossful voice transforms aren't generally reversible, but they still leak information.

In the good ol' days, before computers and such, it may've been enough to lossfully transform one's voice. Then if a third party wanted to recover the original speaker's voice, they couldn't – which, back then, would've probably been it.

Today, we can use computers by:

Establish the ensemble of possibilities with their prior probabilities tagged.
Run the voice-anonymization software symbolically to generate a probabilistic ensemble of voices.
Take the inner product of that ensemble with, say, a set of suspects to generate an informed set of probabilities.

This method is general to any transform that isn't completely lossful. However, the usefulness of the resulting information will vary with the degree to which the anonymization method was lossful; a mildly lossful transform may still be largely reversible in practice despite not being generally reversible, while a heavily lossful transform may yield so little helpful information that it'd be practically irreversible.

score 2 · Answer 3 · answered Apr 08 '20 at 04:33

No it's certainly not secure.

If I were to do it, I would use speech to text then dictate using a common voice like Stephen Hawking's. That completely eliminates any actual voice information.

The only thing left would be to anonymise your style of dialect by formalising/normalising your vocab/sentences.

Honestly that latter stage is extremely difficult. To normalise a thought is extremely complex. You would be divulging personally identifiable information without it though.

score 0 · Answer 4 · answered Apr 13 '20 at 03:25

As in all things InfoSec it depends on your threats and the resources of your adversaries.

If you're trying to play a practical joke on your older brother a fake accent is sufficient. If you're trying to fool your wife it's harder.

If you're trying to carry out a complex conversation against an adversary with sufficient technical resources, depending on the context, it's going to be almost impossible without significant assistance, unless you're ok with them knowing you're hiding your voice.

The problem isn't pitch, it's going to be all sorts of things that you do unconsciously. You have "catch phrases", things you say. You have a cadence to your speech, word usage, and more importantly specific words you consistently misuse. You will have a few words you pronounce differently than most people, or a regional accent etc. It's almost like a finger print.

You can train yourself out of some of this where you catch it, but then THAT becomes your finger print.

You could possibly (if you're a good actor) "Adopt the role" and deliberately change many of these things "just for the role" and then drop it when done. That will fool many types of analysis, but that's a LOT of work and you have to be on every time.

score -3 · Answer 5 · answered Mar 13 '20 at 04:42

-3

We are now in the age of Machine Learning.

Any obfuscation achieved through transformation of information should not be considered secure. Not now. Certainly not against future technology. ML is able to reverse-transform.

You can think of this in terms of Manifold Topology. Suppose a kitten pic is distorted, projected onto some manifold. Let's say a cylinder. Just as a human brain can perceive the manifold and unwrap the image, so can ML.

To achieve true obfuscation, the information CONTENT must be separated from the information STYLE. This can also be achieved via ML.

You can look at the images from https://towardsdatascience.com/a-neural-algorithm-of-artistic-style-a-modern-form-of-creation-d39a6ac7e715 to get a visual sense of this.

An old-school voice anonymizer might break the incoming audio into MFCC feature vectors and reconstruct the vectors into audio.

If it is a little more advanced, it might break these MFCCs into timed phonemes, and then reconstruct the audio from this.

The most secure approach would be to use existing STT->TTS technologies.

But simply pitch shifting is no better than wrapping a kitten-pic around a cylinder. You can still make out if it is your kitten or not.

answered Mar 13 '20 at 04:42

P i

143
1

9

1. "Because magic (ML) might defeat it" is not an answer. 2. You have equated the kitten pic to voice anonymization but not proved or explained how they are in fact relatable. Can you explain or expand what you mean by "MFCC feature vectors" and "STT->TTS technologies"? These appear to be the actual valid points to your answer but you don't provide enough to be able to make sense of them. – schroeder Mar 13 '20 at 07:41
1

I would be curious as to this as well. Machine learning techniques, like neural networks, are composite functions trained over a data set. If you use obfuscation techniques that actually remove data, and potentially add some randomness into it, it can still be understood by people as a voice. However, if you were to try and train a NN to reconstruct the originally, it would have to guess and interpolate to try and recover the lost and random shifted data... this likely cannot be proved to be a 100% reversed version of the obfuscation, ever. – Jarrod Christman Mar 13 '20 at 16:28
@JarrodChristman the goal is not to perfectly reverse the transformation, but to recover the voice "fingerprint" to enough fidelity that the speaker could be identified. This answer could be improved but makes a very good point that modern machine learning techniques (e.g. autoencoders) can very effectively reverse information-preserving transformations. The goal of an anonymizer, therefore, is to destroy the speaker-identifying information without destroying the ability of the listener to discern the linguistic content. – reo katoa Mar 13 '20 at 20:56
@reo, depends on what threat you’re concerned about. If it’s a legal threat, I think you’d have a very good defense to discount the reconstructed voice as evidence. – Jarrod Christman Mar 14 '20 at 00:58
Weird that this is downvoted so heavily. It's 100% correct. It's easy to artificially synthesize a hundred hours of pitch-shifted audio, and feed that to a ML. It will learn how to un-shift the audio. MFCC's are indeed a very common audio transform used in STT, exactly because the transform removes speaker variation. That makes STT easier, and for the same reason makes it a reasonable choice in voice anonymization. – MSalters Apr 10 '20 at 15:48

Is changing pitch enough for anonymizing a person's voice?

5 Answers5

Analogy: Reversibility of reducing a name to its length.

Point: Lossful voice transforms aren't generally reversible, but they still leak information.