7

https://people.cs.umass.edu/~amir/papers/CCS18-DeepCorr.pdf

https://www.youtube.com/watch?v=_OKLtKgEn4k

I have some questions about this "Deepcorr". Does "DeepCorr" really work that good? They say that "DeepCorr’s performance does not degrade with the number of test flows", but more people are using Tor more and more people browse websites (many of those are simple websites) with similar sizes around the same time, so how can they tell the source of the traffic using size and timing alone when other flows have similar features?

They said they used 1,000 circuits to browsed 50,000 sites (the top sites on Alexa) with each circuit browsing 50 sites and they also used regular Firefox browser instead of Tor browser. Could it be the reason for why it worked so good for them? Maybe Firefox generated some extra unique traffic that Tor browser wouldn't generate because of things like ads and cookies?

Can this attack work against hidden services (version 3) as well?

Anders
  • 64,406
  • 24
  • 178
  • 215
Eleanor
  • 91
  • 4

2 Answers2

3

Can DeepCorr's correlation technique de-anonymize all Tor users?

No, it cannot de-anonymize ALL Tor users.

(however it drastically narrows down the scope to proceed with successful de-anonymization)

Why, and what is flow correlation attack

Flow correlation attack is an attack where adversary intercepts network flows at various network locations "correlating" them using math statistics or machine learning methods (e.g. neural networks).

DeepCorr's setting consists of a network "with M ingress flows and N egress flows": DeepCorr listens to ingress flows, being closer to the group of users at the one end, and it just tries to figure out the moment when traffic starts leaving the circuit at another end. And it means "gotcha"!

Website != flow

more and more people browse websites(many of those are simple websites) with similar sizes around the same time how can they tell the source of the traffic using size and timing alone when other flows have similar features?

DeepCorr does not do website fingerprinting (which is another class of attacks, as mentioned in the article), it just correlates a "flow A" to "flow B" at two different points of network.

Website similarities don't matter for successful correlation, DeepCorr operates with features of small packet sequences: sizes, times, flow direction (in/out), etc.

Still...

Correlation != de-anonymization

From the article:

To be able to perform flow correlation, an adversary needs to observe (i.e., intercept) some fraction of flows entering and exiting the Tor network. The adversary can then de-anonymize a specific Tor connection...

I would say "but may not de-anonymize"... I mean that seems like a successful flow correlation attack doesn't automatically mean a successful de-anonymization. Correlation means "these users visited those group of sites" (but it drastically narrows down the set of users and increases de-anonymization probability).

Does Firefox generate additional patterns?

Could it be the reason for why it worked so good for them? may be firefox generated some extra unique traffic that Tor browser wouldn't generate because of things like Ads and cookies?.

In my opinion there is no much difference between Tor and Firefox traffic flow.

Example: google.com

Firefox:
25 requests
1.31 MB / 677.67 KB transferred

Tor:
19 requests
1.39 MB / 498.30 KB transferred

Intuitively, I would say that both browsers generates some unique patterns of flows, and doesn't forget that website != flow.

Also seems like DeepCorr doesn't need too much of traffic to measure:

"the correlated flows are 300 packets long for all the systems"...

Tor's hidden services

Can this attack work against hidden services(version 3) as well?

I would say "why not": DeepCorr performs on traffic flows, it doesn't care whether the flow is "hidden", and hidden service is just another traffic flow. DeepCorr will correlate ingress and egress, and it is what it does.


P.S.: a few words about a possible countermeasure.

Countermeasure

As authors stated:

"Our results suggest that (public) Tor relays should deploy a traffic obfuscation mechanism like obfs4 with IAT=1 to resist advanced flow correlation techniques like DeepCorr."

(IAT=0 doesn't help)

"However, this is not a trivial solution due to the increased cost, increased overhead (band-width and CPU), and reduced QoS imposed by such obfuscation mechanisms... designing an obfuscation mechanism tailored to Tor that makes the right balance between performance, cost, and anonymity remains a challenging problem for future work."

Alexander Fadeev
  • 1,244
  • 4
  • 10
  • But the traffic flow depends on the website the user is browsing so why several users can't have similar traffic patterns?. Tor also have some padding between the user and entry guard https://gitweb.torproject.org/torspec.git/tree/padding-spec.txt but I don't know if it was implemented before or after the research. – Eleanor Apr 15 '20 at 14:16
  • @Eleanor User is clicking to different URLs with different delays generating unique traffic sequences, that's why I wrote that "website != flow". Also you often says "similar patterns", but similar != same, particularly for ML model. `...[pattern].... -> (ʘ ͟ʖ ʘ) -> .... gotcha!` – Alexander Fadeev Apr 15 '20 at 14:41
  • So they used the network jitters to correlate between the traffic coming in from the entry guard and out of the exit node?. I remember that they said they disabled some Tor features and used Firefox browser in order to avoid entry guard reuse and to enforce circuit selection. But if many users use the same entry guard as you and/or other Tor relay you use in your circuit wouldn't it reduce the difference between you and them when it comes to network jitters?. Could it be the reason for why they used different entry guards and enforced circuit selection in their experiment?. – Eleanor Apr 15 '20 at 14:58
  • @Eleanor Sorry for "jitters": I removed previous comment because I'm not sure about this and seems like I missed the point, basically it's not easy to interpret this academic language... – Alexander Fadeev Apr 15 '20 at 15:03
  • @Eleanor Basically, the research starts from acknowledging that Tor has guard relays and that there are attacks to overcome these guards: "The Tor project adopted “guard” relays to limit adversary’s chances of placing herself on the two ends of a target Tor connection. Borisov et al. [8] demonstrated an active DoS attack that increases chances of observing the two ends of Tor connections (who then performs flow correlation). Alternatively, various routing attacks have been presented on Tor [20,38,70,72] that aim at increasing odds of intercepting the flows to be correlated" (shrinked) – Alexander Fadeev Apr 15 '20 at 15:10
  • @Eleanor And maybe I missed your question again, you mentioned "if many users use the same entry guard as you and/or other Tor relay you use in your circuit wouldn't it reduce the difference between you and them": yes, it reduces the difference, that's why I wrote that correlation != de-anonymization. One more step is needed for successful de-anonymization. – Alexander Fadeev Apr 15 '20 at 15:13
  • So could it be the reason why they disabled "vanilla Tor"(don't know exactly what it is) to avoid entry guard reuse and used Firefox in order to enforce circuit selection?. Did they used it to give their test flows a more unique flow patterns?. – Eleanor Apr 15 '20 at 15:26
  • @Eleanor No no, they just emulated the situation when attacker captured a victim (it's still possible even with entry guards), because it is a precondition to carry out a flow correlation attack. They did "pinning" of circuit to victim. (Researchers had to do their experiments with their ML model somehow, right?) So it should not have any impact on flow patterns. – Alexander Fadeev Apr 15 '20 at 16:39
  • @Eleanor See, they concentrated on testing their ML model... But in the real conditions the attacker DOESN'T know the victim: HOWEVER, attacker applies DeppCorr at TWO points of Tor network and WAITING until correlation happens. Once correlation happens: Gotcha! Inputs correlated to outputs! Let's apply additional de-anonymization techniques! – Alexander Fadeev Apr 15 '20 at 17:22
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/106758/discussion-between-alexander-fadeev-and-eleanor). – Alexander Fadeev Apr 15 '20 at 18:21
0

I don't think that their method works for identify the user, I will focus on the flow correlation technique and explain why is that impossible of make it happen(of course I can tune and create a use case and make it possible but not for the internet).

In general all communications right now are using TLS for encryption, tor does the same, and HTTP 1.1. In HTTP 1.1 several requests and responses will go on the same flow, this means that you need to correlate the amount of upstream pdus(by checking also the TCP push flag) and the downstream. For example, if I make a python scrypt that makes access to two urls and downloads two images, a system could generate a vector of flow characteristics like:

[{"upstream_bytes": 500, "downstream_bytes": 5000},
 {"upstream_bytes": 400, "downstream_bytes": 4000}]

The first request will generate 500 bytes of data encrypted upstream and receives 5000 bytes of encrypted data downstream, and with the second request, 400 up and 4000 down.

Taking in account this minimal scenario and also that the browsers generate different requests sizes, probably most browsers will generate for the first request (index.html) a similar first requests pattern with some variance on the bytes.

So multiple users accessing to the same services will have a vector like

[{"upstream_bytes": (500, 600), "downstream_bytes": (5000, 5500)},
 {"upstream_bytes": (390, 420) "downstream_bytes": (4000, 4300)}]

The upstream and the downstream will vary depending on the factors like the browser, and the encryption made on the TLS.

So if the destination site have also php support and other group of users access to index.php, the probability that the vector are the same is high. This means that even if you use machine learning, or other technology is impossible to guest what is inside of the flow and make the correlation even more impossible. The only flow correlation that can be make is by compare the vectors (of upstream and downstream) with other vectors of other flows and compare them by statistics. In a test scenario that you have the control of the networks (user and server) probably you can guess easily because you don't have any other traffic that can generate noise for the detection.

If you want to think also in a destination server that just serve pdf documents that all of them have the same size, this will give the same traffic distribution to all the users using that service, make them impossible to know the content.

On the other hand, nowadays browsers generates multiple network flows to the same site, make this even more difficult.

In general the paper is nice and have some tips interesting but a lot of research papers, specially the ones that detects things, tune their results for get them published, I'm not say that is the case, but looks a bit suspicious that the results are good and also they don't publish the data set so other researchers can validate or improve the techniques that they describe.

camp0
  • 2,172
  • 1
  • 10
  • 10
  • 1
    Granted, I'm still making sense of all of this myself, but I'm not convinced that this answer really changes much. For instance you talk about a PDF hosting service and how everyone will get the same size requests, therefore making it impossible to know the content. However, how is any of that relevant? The goal of correlation techniques isn't to figure out what is being exchanged, but to figure out who is talking with who. If the PDF hosing site has fixed size upstream/downstream requests then that actually makes it *easier* to figure out who is accessing it. – Conor Mancone Apr 20 '20 at 16:48
  • 1
    Maybe I'm just misunderstanding what you are saying because I genuinely don't understand how any of this shows that correlation attacks are impossible. – Conor Mancone Apr 20 '20 at 16:49