What you're asking is a simple question with a very complex answer.
Each tone you're hearing when an oldschool modem dials up is indicative of a set of data that is being transmitted. Sound pitch is measured in Hz. Your average adult can hear from around 20Hz to 20,000Hz (but we can't always distinguish the difference between 20Hz and 21Hz).
So say for example, a 20Hz pitch means 0 and a 21Hz pitch means 1. So to transmit
00000110
You would transmit 20Hz 20Hz 20Hz 20Hz 20Hz 21Hz 21Hz 20Hz
. At something rediculously slow like 1-baud, that would take 8 seconds to transmit that data.
But why bother, when you can say that:
30Hz = 00000000
31Hz = 00000001
32Hz = 00000010
33Hz = 00000011
34Hz = 00000100
35Hz = 00000101
36Hz = 00000110
37Hz = 00000111
etc so on and so forth. So the same dataset (00000110
) can be represented as 36Hz
and you've transmitted 8 bits in 1 second, rather than 8. Congratulations, you've sent 8 bits of information in a single cycle.
Now a MODEM over a crappy telephone line, can only distinguish so many different freqencies, and there needs to be error checking etc in place, but to transmit:
000000110000011100000100
You're going to need a lot of different audio pitches played after eachother super fast (56,000 of them per second in the case of a 56K modem). When you play 56,000 different tones within the space of a second, that dialup sound is what you get.