0

I am always curious by reports in the news of big network sites getting hacked and the report confidently stating a statistic such as "only 10,000 users were affected" or "Microsoft confirms 40,000 accounts compromised".

Here's just a few examples:

My quest is, when a site has hundreds of millions of users, how can they be so sure that only a portion were compromised?

I can think of 7 possible reasons (listed here in descending order of how plausible I feel they are):

1) More than that were hacked but the hacker only revealed the first batch so the report can only confirm that many were hacked but it was probably more. When a hacker downloads millions of users you'd probably want to chunk the files up for convenience, possibly alphabetically by first initial. Maybe you'd only paste the first couple of lists (A & B) online to prove the validity of your hack and the compromised company would then only "confirm" that those users were compromised because only their details were made public. Maybe a deal with the hacker was made which ensured that the remaining lists were be safely destroyed. Seeing as this scenario is potentially financially beneficial for all parties involved, I'm putting this as most likely.

2) The hack actually happened on a different site to the company who are brave enough to report it. For example a gaming club site gets hacked, 14,000,000 user credentials are stolen but only 30,000 of those users were naive enough to use the same username and password for their Google Mail account. So even if the gaming club do not report the hack, Google might be able to detect the millions of login requests and see that only 30,000 were successful and do the right thing of informing the public, even if the site that actually got compromised have chosen to remain silent.

3) They are choosing to define "compromised", "affected" or "targeted" in a way that is favourable to the company's reputation. For example, if a hacker downloaded a dump of the whole users table with 10 million rows but only 10,000 of those rows contained credit card details (because a daily task cleans up data older than 30 days or something) do they only define those as "compromised" and decide everyone else - who only had their name, email and date of birth stolen - are defined as "uncompromised", "unaffected" or "not targeted".

4) The site introduced better security practices at a certain point in time so only users who have not reset their password since then are considered "affected". For example the company launched a shoddy first version of the codebase that stored passwords as MD5 hashes (weak) and after a couple of years they started to get popular so they beefed up the password encryption to use bcrypt, a salt, and a large number of rounds. But they couldn't be bothered to decrypt all the existing users and re-encrypt using a salt so they decided that they would just let existing users change it as and when they feel. Hence, users with a null value in the salt column are considered "compromised" and every with a salt is highly unlikely to be decrypted because their developer wrote a particularly convoluted hashing logic that the hacker will never guess (sarcasm).

5) The hack was on the consuming end of an API that has built-in limiters (ie. They never actually hacked right into the database server, they just stole a privileged user's credentials). For example if you structure your architecture with several additional layers of abstraction and audit activity then when an authorised client starts to perform suspicious requests such as downloading "all" of something, the server automatically puts a limit on that client/account with a cool-off period and manually alerts a human to investigate if these requests look legit. So the story would be "Frank in operations got his laptop stolen in the pub and they used his key to start downloading batches of user records but we detected this and froze the account before he'd finished his cider and realised his bag was missing". The API audit trail would also enable them to know exactly what the hacker downloaded before the account was frozen. Sounds plausible but I'm putting it low on the list because you'd have to be seriously paranoid and well-resourced to engineer this into your API.

6) The hacked sites data storage is sharded and the hacker only got access to 1 of the shards. Every system that I - or any of my colleagues - has ever maintained stores all the user credentials in a single database so if a hacker did manage to bypass security, they could quietly take a copy of everything or nothing, there's no middle ground. Even with a sharded database, there still exists one set of credentials that can be used to query the whole dataset. Hence, I have put this as unlikely.

7) The site uses some kind of sophisticated customised database storage that detects "SELECT ALL" type operations and sets off an alarm that halts it. I know that all systems I maintain (MySQL/Postgres/Mongo) do not have any such alerts configurable and if I want to run a big slow dump operation, it will let me as long as I'm logged in as admin. In fact this kind of operation is occasionally required for backups or planned maintenance and it would be highly inconvenient for it to not be possible. But several companies report that they stopped the hack while it was in progress so I'm including it but I'm putting this as least likely.

Obviously there is likely another scenario which I have failed to think of. I appreciate that this question is a lot of conjecture but I am having real trouble finding articles that dig deep into this. I would really appreciate some expert steer on this. Have I missed something obvious? Is my ordering off?

I ask this question not as a security expert but as an experienced web developer with a keen interest in application security. Whilst writing this question I found this paper very helpful https://ssl.engineering.nyu.edu/papers/tr-cse-2013-02.pdf

  • It looks like you are asking for a single answer for what you outline is a complex and multi-faceted problem. Or you are asking for a list, which is not a good fit on StackExchange. – schroeder Jun 02 '20 at 16:03
  • Test systems, backups, 3rd party systems, regional systems/databases, developer local databases, etc. etc. etc. – schroeder Jun 02 '20 at 16:05
  • @schroeder Good point. All that lot fit into a "copy of data going walkabout" category. In which case I can imagine it would be even harder to follow an audit trail. If a hacker copies a database dump file from a developer's laptop (physically or remotely) there is hardly any footprint left behind. Scary. Good point. Thanks. – Martin Joiner Jun 03 '20 at 08:37

1 Answers1

1

Usually it is specialized professionals, not system maintainers who determine the impact of an incident, including the number of affected users/accounts.

They do it by forensically analyzing available artifacts, especially logs (web requests, application, database etc.). Robust applications have both audit logs and network constraints enforced by independent systems, so usually attackers can only exfiltrate data through the same channel used to compromise the application, leaving an audit trail.

In case of serious vulnerabilities all users are considered to be affected unless proved otherwise, but some owners afraid of reputational impact "forget" this best practice and only clearly report the number of users which are confirmed to be affected. You can spot them by reading between the lines. Regulations such as GDPR should prevent this - should.

Also some of the points you mentioned sometimes apply, but in my experience with small and medium systems they were the exception.

Enos D'Andrea
  • 1,047
  • 5
  • 12
  • Thank you Enos, it's good to get a 2nd opinion. You've confirmed that my understanding is pretty much an accurate assessment of the reality. It's sad to think that so many systems I know of simply do not record robust trails so in the event of a hack the chances of even forensic experts garnering any understanding are minimal. – Martin Joiner Jun 03 '20 at 08:27