Your idea of fingerprinting is very similar to wireless signals intelligence in WWII. Both sides used to have whole departments whose role was to learn the code style, or "fist" of the opposing side's wireless operators. By tracking these profiles and using radio direction finding they gained a surprising amount of information about troop and vessel movements, staff assignments, etc.
You're thinking of doing the same thing, learning the nuances of how particular crackers operate. Thinks like typing cadence, frequently repeated typing mistakes, etc could be used to learn a particular cracker's "fist". I think this is a good idea in some ways, but maybe not good enough to pursue:
- Most attacks are scripted. Even when a top cracker is doing the hacking it's usually scripted before and after a successful exploit, so you'll have to wade through hundreds of attacks to find one fingerprinting opportunity
- Data sources: you'll have a hard time gaining enough data to do any actual fingerprinting. The number of data sources you would need is far more than a simple research project, you'd need a dozen honeypots at least, and a very large database of information to work off of, with some complex modeling to interpret the data
- Network latency and jitter are common on the internet, especially when traffic is coming from areas with poor internet connectivity. These areas will happen to be the source of many attacks, so your results could end up being skewed significantly. Is that pause followed by a flurry of typing the hacker's style, or simply network lag?
- Verifiability of results: How can you prove your fingerprinting methods are in any way successful? How will you show that the patterns you find actually demonstrate a single attacker? They aren't going to come out and say, "yeah, that was me!"
My suggestion is to try this small-scale where you can control some of the factors. Get many volunteers (plus some scripts) to follow set scripts of commands in a terminal window and see if you can write algorithms that can reliably determine they typist. Then introduce packet latency and jitter to see if your algorithms can cope, and work on that. Once you have that working you could then go out to the internet and see if they work in the wild, otherwise you won't have any idea as to reliability.