Nothing, except loss of fidelity due to the recording and playing back, provided the system is amenable to a replay attack.
But if you do acquisition and playing back at a higher fidelity than the voice recognition system was built for, the latter won't have a clue.
It might be possible to analyze echoes and harmonics: a human phonatory system does not produce sounds from a single point in space, while a loudspeaker does. This would require several sensitive microphones placed in different positions, to be able to calculate time-of-flight for different phonemes.
Challenge/Response
Another possibility is if the attacker has only access to a fixed recording and we can also do voice recognition.
I think I saw this in some 007 film, with the guy approaching a voice-activated door and fiddling with his watch, from which the 'Nice party. I recommend you the shrimp salad...' captured the evening before in the villain's voice issues forth, unlocking the door.
But what if the door had asked, 'Repeat after me: horse battery staple correct'? The shrimp salad wouldn't have cut it.
So:
- the voice is enrolled
- the user is asked to pronounce a certain sequence, different every time
- the sequence and the voiceprint must match.
This reduces the chances of a replay attack because even if someone recorded my voice saying '577892', they wouldn't be able to pronounce in my voice '297779'. Or would they? With a large enough sample and voice synthesis technology similar to Loquendo TTS, it is possible to have a computer say anything in my voice. With only a few words or digits, the attacker doesn't even need that much technology.
The need to avoid both false negatives and false positives, added to the background noise, requires threading a very difficult needle: you could reject exactly identical phonemes as being recordings, but background noise (either real or faked) would make this very hard - two playbacks of the same sound would be acquired as different, while the same person's voice will naturally generate almost identical phonemes.
Over a telephone, any attempt to distinguish between "real" and "artificial" phonation will fail, because the phonation will always be artificially flattened by the sender's microphone.
I am in no way an expert in artificial voice fakery, but I'm quite confident that a very reasonable budget to acquire voice samples, recording equipment and a voicesynth framework will allow you to bypass any such voice-authentication over the phone. Against an unprepared opponent armed with just a tape-recorder, voiceprint plus challenge/handshake will probably always win.