I know this was already answered, but a lot of important details were left out.
How onion routing works
Onion routing is an anonymity technique where a path is chosen randomly through a cluster of servers, such that each connection goes a different route. The specific relays for the guard, middle, and exit are chosen randomly by the Tor client. The path from guard to exit is called the circuit, and the Tor client remembers this. The guard is chosen once and remains the same for a long time (as explained below), while the middle and exit change at periodic intervals (either once every ten minutes, or when a new connection is established). The unpredictable path and large number of relays to choose from improves anonymity significantly.
(source: torproject.org)
When you send data over Tor, the data is encrypted with three keys. Each layer specifies the subsequent relay to be used (as chosen by your client at random):
- An application, like Tor Browser, requests a webpage through Tor, and tells that to the client. This request is done on your local network using the SOCKS5 protocol.
- The Tor client encrypts data with three keys, and shares each key with a different, random relay. Encrypted in each layer is also the address of the next relay. This is then sent to the guard.
- The Guard receives data and strips the third layer, using its key. It forwards the data to the relay specified in the third layer, the middle relay.
- The middle relay receives data and strips the second layer, using its key. It forwards the data to the relay specified in the second layer, the exit.
- The exit receives data and strips the final (first) layer, using its key. It checks the destination site and forwards the now fully decrypted data to it.
- The destination site receives the data and sends a response to the originating IP, the exit.
Now the traffic has been successfully sent to its destination, but it has to get back. Tor relays keep which relay is communicating with it in memory, so when it gets a response from that relay, it knows where to send it. This way, the middle relay knows the guard asked it to send data to the exit, and it remembers this so when that same exit gives it data back, it can forward it to the guard:
- The exit receives the response, adds the destination of the previous relay (the middle), encrypts it with its key, and sends it off to the middle relay.
- The middle relay receives this, adds the destination of the previous relay (the guard), and adds another layer of encryption using its key before sending it to the guard.
- The guard receives this and adds a third layer of encryption with its key before giving the data to you, the Tor client.
- The Tor client receives this and strips all layers of encryption. It then gives the response to whatever application requested it (such as Tor Browser).
This is the original concept behind onion routing. All this happens in a second or two.
Who can see what?
I actually wrote this answer because no one linked to the obligatory EFF diagram on Tor. This depicts each point of interest and what a given adversary can observe:
From the perspective of the relays, three things are true:
- The guard knows who you are (your IP), but not what you are doing (your destination).
- The exit knows what you are doing, but not who you are.
- The middle node knows nothing about you.
The anonymity stems from the fact that no one entity can know who you are and what you are doing.
Traffic analysis attacks against Tor
In order to be deanonymized using Tor, assuming no attacks against you directly (software exploits, backdoored hardware, OPSEC failures), an entity that knows who you are, and an entity that knows what you are doing must collude. In the diagram, that adversary is labeled as the NSA. The black dotted line shows data sharing, which means that the precise timing information can be used to correlate you. This is called a traffic analysis attack, and is a risk when your adversary monitors both ends of the connection. Tor has only a limited ability to protect against that, but thankfully it is often enough, since there is so much traffic to blend in with. Consider the following timeline of events:
- ISP1 sees
203.0.113.42
send 512 encrypted bytes (253 unencrypted) of data at t+0.
- ISP2 sees
example.com
receive a 253 byte request for /foo.html
at t+4.
- ISP2 sees
example.com
send a 90146 byte reply at t+5.
- ISP1 sees
203.0.113.42
receive a 90424 encrypted byte reply (90146 unencrypted) at t+9.
ISP1 is any ISP between you and the guard, and ISP2 is any ISP between the exit and the destination. If all this can be monitored, and ISP1 and ISP2 collude, then with sufficient computation, one can conclude that IP address 203.0.113.42 accessed example.com/foo.html
. Tor makes this harder in a few ways. First, persistent guards reduce the chance that an adversary will be able to observe steps 1 and 4 by adding a large number of malicious guards to the network. Second, Tor sends traffic in cells of 512 bytes each (or at least used to be. It's 514 bytes now), so step one would involve sending 512 bytes, yet step 2 would still show 253 bytes received. Third, the number of hops Tor goes through increases jitter in latency. Because of this, each subsequent timestamp will differ by a small but random time. This makes it hard to distinguish other connections which transfer a similar amount of data at a similar time from your connection.
There have been many academic attacks against Tor which rely on traffic analysis, but they always assume a small world where latencies are all fixed and deterministic. These are the attacks that tend to be reported on in the media, despite not applying to the actual Tor network in a world where every network is full of noise.
Traffic analysis attacks against a proxy
This kind of attack is difficult to pull off against Tor because an adversary may not have access to both ISP1 and ISP2. Even if they do, the infrastructure of one of them may not be sufficient to record high-resolution timestamps (for example, due to reduced granularity NetFlow records), and their internal clocks may differ slightly. With a proxy, however, this attack is far easier to pull off. This is an issue even if you completely trust the proxy provider. Consider this alternative timeline of events, where ISP1 represents the ISP of the proxy service itself:
- ISP1 sees
203.0.113.42
send 253 bytes of data at t+0.
- ISP1 sees the proxy server send a 253 byte request to
example.com
for /foo.html
at t+1.
- ISP1 sees
example.com
send a 90146 byte reply at t+2.
- ISP1 sees
203.0.113.42
receive a 90146 byte reply at t+3.
With this information all in the hands of ISP1, it becomes quite easy to conclude that 203.0.113.42 requested example.com/foo.html
. There is no padding, and virtually no jitter (since the delay is only as long as it takes the proxy service to forward the request internally). Because of this, this single ISP knows both who you are and what you are doing, and all it has to do is connect the fact that they come from the same person. Simple. This is the main technical downside to proxies, even when their often sketchy nature and history of poor honesty is ignored.