To complete @Terry's answer: a GPU has a lot of cores (hundreds). Each core is basically able to compute one 32-bit arithmetic operation per clock cycle -- as a pipeline. Indeed, GPU work well with extreme parallelism: when there are many identical work units to perform, actually many more than actual cores ("identical" meaning "same instructions", but not "same data").
Some details, for a somewhat old NVidia card (a GTX 9800+, from early 2009): there are 128 cores, split into 16 "multicore units". Each multicore can initiate 8 operations per cycle (hence the idea of 128 cores: that's 16 times 8). The multicore handles work units ("threads") by groups of 32, so that when a multicore has an instruction to run, it actually issues that instruction to its 8 cores over 4 clock cycles. This is operation initiation: each individual operation takes up to 22 clock cycles to run. You can imagine the instruction and its operands walking into the circuit as an advancing front line, like a wave in a pool: a given wave will take some time to reach the other end of the pool, but you can send several waves sequentially.
So you can maintain the rhythm of "128 32-bit operations per cycle" only as long as you have at least 22 times as many "threads" to run (i.e. a minimum of 22·128 = 2816), such that threads can be grouped by packs of 32 "identical" threads which execute the same instructions at the same time, like hip-hop dancers. In practice, there are some internal thresholds and constraints which require more threads to achieve the optimal bandwidth, up to about 4096.
I could achieve close to 99% of the optimal bandwidth with a SHA-1 implementation. SHA-1 uses a bit more than 1100 32-bit operations (that would be around 900 on a CPU, but a GTX 9800+ has no rotation opcode, so rotations must be split into two shifts and a logical or), and the GPU ran at 1450 MHz, for a grand total of about 160 million SHA-1 computations per second. This can be achieved only as long as you have millions of SHA-1 instances to compute in parallel, as is the case for password cracking (at any time, you need 4096 parallel SHA-1 to feed the GPU cores, but you also have to deal with I/O costs for input of potential passwords, and these costs will dominate if you do not have a lot of SHA-1 instances to process).
The host PC, on its CPU (a quad-core 2.4 GHz Intel Core2), could achieve about 48 million SHA-1 per second, and that was with thoroughly optimized SSE2 code. A single SHA-1 will use about 500 clock cycles on such a CPU (the CPU can compute several instructions in a single cycle, provided they don't compete for resources and don't depend on each other), but, for password cracking, it is worthwhile to use SSE2 with its 128-bit registers, and able to compute 4 instructions in parallel. With SSE2 constraints, it takes about 800 clock cycles to run four parallel SHA-1, so that's 200 clock cycles per SHA-1 instance. There are four cores in that CPU and the whole thing runs at 2400 MHz, hence 48 million per second.
More recent hardware will be faster, but GPU more so. A GTX 680 sports a whooping 1536 cores, and there are two such GPU in a GTX 690. We are talking billions of SHA-1 instances per second here.
(For comparison, I also did an implementation of SHA-1 on the Cell processor, i.e. the CPU in a PS3 console, with its 8 "SPU" coprocessors. One SPU was not available. With the 7 others, I reached about 100 million SHA-1 per second, i.e. better than a contemporary big PC CPU, but not as good as a good GPU of the same era.)
Summary: GPU achieve great performance by using heavy parallelism, with hundreds (if not thousands) of cores. This is made possible by pipelining (each individual operation takes many cycles to run, but successive operations can be launched like trucks on a highway) and sharing instruction decoding (since many cores will run the same instructions at the same time).