2

In London, there are trash bins that track a phone's MAC address and monitor the users movement from one location to another. I want to see if there is any best practice for collecting private information like this that allows for analysis, but also anonymity.

Imagine the scenario where a MAC address, location information and date time data is stored in a back end database. However rather than store the data in a raw format, only store the summary or trend data.

  • Is there any mathematical (cryptographic) or logical process that can be followed to decouple the MAC address, and location bits of data and "dilute" it so that privacy is maintained?

Some ideas that crossed my mind included one-way hashes, and hommomorphic encryption, combined with some indicators in statistics.

I'm asking here in case someone much smarter than me has come up with an approach that solve the problem of gathering PII, anonymizing it, and allowing for trending and market research.

makerofthings7
  • 50,090
  • 54
  • 250
  • 536
  • Unlikely, MACs are pretty short. – CodesInChaos Aug 13 '13 at 14:50
  • @CodesInChaos not sure what point you're making. *Longer* identifiers are worse for privacy. Saying MAC addresses are short implies they cannot be used to ID people uniquely - imagine a 2 bit MAC. Also 48 bits is a long ID when you're talking about things like human beings. – lynks Aug 13 '13 at 15:35
  • @lynks You can't securely hash low entropy values since it's easy invertible by trying all possible inputs. – CodesInChaos Aug 13 '13 at 15:37
  • @CodesInChaos gotcha, somehow I got lost in the double/triple negatives and thought you were arguing the opposite. – lynks Aug 13 '13 at 15:38
  • @CodesInChaos So there may be two approaches: Use 1/2 of the MAC address (or less than that) for tracking, and expect conflicts. Or encrypt data on the back end if enough entropy is found. Perhaps I can concatenate other data that is from the same device such as Bluetooth ID, IP address, Cellular IMEI, etc. I'm not sure how many more bits are needed for sufficient entropy for a backend store, but it's a start. – makerofthings7 Aug 13 '13 at 15:45

2 Answers2

2

Cryptography is rarely the right tool for privacy issues. Here, the trash bins are listening for the broadcast of MAC addresses by WiFi enabled devices, in particular phones. The tracking part is made possible due to the combination of several parameters:

  • WiFi-enabled devices broadcast data regularly. This is rather unavoidable: for WiFi to actually occur, either the device or the access point must talk first. Since there are "hidden" access points who do not talk until specifically addressed, the devices must blabber about constantly.

  • MAC addresses are fixed. An important point of MAC addresses is their uniqueness: things must be so that no two "unaltered" devices may use the same MAC address simultaneously, on the same local network. To ensure this uniqueness, a global allocation scheme has been designed, with hardware vendors being allocated address ranges. It is possible to force a MAC address change on most hardware, but this is "frowned upon".

  • People don't switch off the WiFi when not at home. They should (in particular, not using WiFi extends battery life), but they do not.

In the example, the trash bins just listen to all the broadcasting, and correlate data between each other, thus "tracking" the whereabouts of phones (and thus, presumably, of phone owners). A lone bin would not get much interesting data, but a lot of bins can come up, together, with a rather thorough map of movement behaviour of people. Note that the MAC address cannot be traced back to an owner identity, but it can, at least, uncover the hardware vendor name, because of the global MAC allocation system, which is public.

What could be done, assuming that we are free to define new protocols, is to replace the fixed 48-bit MAC addresses with random 128-bit addresses (regenerated frequently, e.g. every minute will not actually connected to an access point). Random addresses of 128 bits ensure uniqueness with sufficiently high probability, even if lots of devices are happen to be in the same location. For instance, if you have a stadium full of 60000 people, each with a phone, and they all try to do WiFi, 128-bit random MAC addresses would allow a collision to occur with probability about 2-97, i.e. "won't happen". But here we are talking about defining a new WiFi protocol and hoping for all devices and access points to simply switch to it, forfeiting any attempt at compatibility with existing WiFi access points. This kind of change is unlikely to occur within the next few years.

In the meantime, if you value your privacy, then simply shut down the WiFi !

Tom Leek
  • 168,808
  • 28
  • 337
  • 475
  • Wouldn't more entropy guarantee more uniqueness and guarantee correlation (and therefore less privacy)? What if I were only to track 2^16 bits (or fewer) I'll have achieved my business need of "tracking", my privacy need of "dilution". – makerofthings7 Aug 13 '13 at 15:08
  • I might have been a bit unclear, so I edited my message: I am toying with the idea of randomly generated MAC addresses that are regenerated very frequently; each phone would create a new MAC every minute, and keeping their MAC unchanged only while actively connected to an access point. – Tom Leek Aug 13 '13 at 15:15
  • I don't see how more entropy equals less privacy. The advantage of a 128-bit MAC is that the user (ie. the phone) can rotate MACs according to some schedule. It doesn't defeat correlation but makes it harder. In any case, if a MAC remains constant, correlations are easily done with any number of bits @makerofthings7 – rath Aug 13 '13 at 15:16
  • My phone keeps track of *where* your chosen WiFi networks are using GPS, then only turns on the WiFi when the GPS indicates that you are nearing a saved network. This I think is a great feature. – lynks Aug 13 '13 at 15:40
  • I am not trying to avoid correlation, but to find a way to allow for correlation, but not get too granular when doing so. I want to be secure when collecting it, secure when analyzing it. One approach is to lower the uniqueness threshold: Given 2^4 bits of entropy, then I only have 16 possible outcomes. When someone shows up with bit 0101 set, then I can statistically infer many things, but not their identity. I can only infer things from group 0101. This is only one approach, – makerofthings7 Aug 13 '13 at 15:41
  • [continued] but a better approach would be where an encrypted datastore would act as an all knowing Oracle that somehow balances individuality while preserving privacy. This is where I was thinking homomorphic encryption could play a part. – makerofthings7 Aug 13 '13 at 15:41
  • "In the meantime, if you value your privacy, then simply shut down the WiFi !" -- While you're standing in view of one of the thousands of London CCTV cameras? If you value your privacy, vote out the bureaucrats that violate it. There will always be methods or technologies they can use to track or monitor people. Making it illegal is the right first step. – u2702 Aug 13 '13 at 17:05
1

One core problem is that you can often use social network analysis to reverse a network of people's activities, regardless of how you anonymize the identities. Coupled with external data that you don't control, such information can reveal identities.

When you collect such data, if you want it to remain anonymous, you need to remove relationships present in it. If you track user 123 takes train A at 8:20, then takes train B at 8:50, and then takes train B at 17:05 and train A at 17:30, you're building a map of that user. Check your goals. If you are trying to determine train ridership, you don't need to know that the same user was taking each of those legs, only that train A had a +1 at 8:20 and another +1 at 17:30.

John Deters
  • 33,650
  • 3
  • 57
  • 110