1

I'm writing keystoke dynamics identifcation as an assignment and I'm having trouble with storing data. I was instructed only to "don't keep raw data" as it should've been transformed. I wasn't told what form data should take to not be considered raw or how to transform it. Unfortunately I couldn't find any articles/publications that'd tell me either.

I am recording raw data as follows:

key | Pressed/Released | Time since last event
Z P 64
Z R 96
A P 88
P P 72
A R 9
P R 64
O P 88
O R 81

Then I process it by separate metrics: UpToUp, dwell, flight, interval and latency. Let's look at UpToUp data (they all output frequencies in the same format). Timings are calculated:

Z -> A: 169 (88+72+9)
A -> P: 64
P -> O: 169

Then grouped and counted:

[(64,1), (169,2)] or more general [("60-70",1), ("160-170",2)]

Keystroke recording can be of any length. If it were bigger, we'd get more diverse frequencies that'd be normally distributed .

I'm looking for confirmation or recommendation, whether these frequency countings can be transformed further, so it's more distant from raw data. As I said, I couldn't find any publication which would answer me.

With voice biometrics I came across FFT, but I couldn't find it in keystroke context.

wojteo
  • 71
  • 6
  • 1
    How would such data leak exactly? Are you trying to prevent webpages from identifying the person behind the keyboard or is the keystroke data gathered in another way? With what you describe, the timing of the key-ups might still be unique, with keys only registering when you let them go. – J.A.K. Dec 30 '18 at 13:58
  • 1
    Leak is an example. My main concern is about good practices – wojteo Dec 30 '18 at 14:08
  • If the timing of both up and down were to leak directly (e.g. from a wireless keyboard), there is little you can do to prevent that. Only sending key-ups would seriously impact user experience, keys would only register when you let them go, creating a sluggish feel. It would also make holding a key down impossible. If you're ever storing keystroke data, don't store any timings. – J.A.K. Dec 31 '18 at 04:33
  • I use 5 different metrics such as dwell time (PR) and intervals (RP). – wojteo Jan 01 '19 at 11:15
  • Wireless leak is not the issue. Let's say we consider more or less database leak. If I shouldn't store timings, then what should I store? – wojteo Jan 01 '19 at 11:19
  • Dwell time and flight time are often both required for biometric purposes. However if you need to store data relevant for biometric purposes, then it absolutely will be sensitive data. You can reduce the granularity of the timing information, but how low can you reduce it before the biometrics are still useful for you? – forest Jan 02 '19 at 04:52
  • Since the number of events added together is not recorded, there is no method by which the original timing can be recovered from your final listed transformation. So, my guess is that any leak of the final transformed data, by itself, should be of minimal impact... – RubberStamp Jan 03 '19 at 02:00
  • 1
    You have asked a question without proper definition. "How long is a piece of string?" What are the threats? In what form does the remaining data need to take? What transformations need to happen? "Enough" for what? I think that you need to define those things before anyone can come up with an answer. "all good practices should be employed" appears to be part of the assignment. Who deems that you have been successful in doing that? "Not keeping all the raw details" -- again, how do you know you have been successful? – schroeder Jan 03 '19 at 10:37
  • Perhaps related... [external research and analysis source code](https://userinterfaces.aalto.fi/136Mkeystrokes/) for keystroke behavior study. – RubberStamp Jan 03 '19 at 13:02
  • 1
    Your edit is a little better, but you do not explain how much of the original data needs to exist to process the data at a later time. Basically, you still have an undefined question. I would seriously ask your prof what they are talking about and what their criteria is. Always go back to stakeholders to verify the reqs and spec. – schroeder Jan 03 '19 at 13:53
  • Fourier Transform translates time domain series into frequency domain series... your algorithm already produces frequency domain series from the time series. Your UpToUp is a sinusoidal representation of the data... and you are counting the relative frequencies within the data. – RubberStamp Jan 03 '19 at 14:05
  • I clarified it with my professor and if I assume the distribution is normal, then I should take as @goncalopp said average (and/or mean) and the variance of that. I can even calculate it not only for each metric, but for every key pair and metric. I could also look for some way of obfuscating these values with scaling or some other function. – wojteo Jan 08 '19 at 20:08

1 Answers1

1

As you mention, if you gather enough data, each key-feature pair will be well approximated by a gaussian distribution. In that case, instead of storing the timing counts ([(64,1), (169,2)]) or even the histograms ([("60-70",1), ("160-170",2)]`), you might want to consider storing the parameters of the inferred distribution - the average and the variance. Of course, that only works if your model can still work correctly from the inferred distribution.

Another thing to take in mind is to avoid storing short runs. Even if you store it in aggregated form, if you capture the moment when a user is typing a password, and the only keys that have valid distributions are 1, 5, 3, 4 and 2, that's not really ideal!

loopbackbee
  • 5,308
  • 2
  • 21
  • 22