42

How a ticket system works

A ticket system - one you see at festivals - works like this: when a user pays for their ticket, a row is added to the database with a column named is_scanned, whose default value is set to false.

As soon as a guard at the festival scans the barcode (containing an ID, and unique hash) with their device, a request is sent to the database to check if:

  1. the user matching the ID and hash has paid, and
  2. if the value of column is_scanned is still set to false.

If both conditions are satisfied, it sets the value is_scanned to true, to prevent someone else copying the ticket/barcode from getting in.

The vulnerability problem

The problem here is the time between the request being sent by the scanning device, and the value is_scanned being toggled from false to true.

Consider this scenario: Alice has a valid ticket which she paid for, but then she lets Eve copy her barcode and changes the visible name on the false ticket from Alice to Eve. So now we have two tickets. One valid and one fraudulent, but both have the same barcode, the only difference is the name.

What if the ticket from Alice and Eve gets scanned at exactly the same time when they enter the festival? The ticket system wouldn't toggle is_scanned to true in time to make sure Eve couldn't enter with the same barcode as Alice. This results in both tickets (the valid and fraudulent) being shown as "valid" to the guards.

Possible solutions

Of course, this kind of exploit really depends on a lot of luck, and while it's possible in theory...in a real scenario, this would probably fail.

However, how can we defeat this kind of exploit also in theory?

Identification

I've already taken this exploit into account using the following method: When a barcode is scanned, I display not only if the ticket is valid (satisfies the conditions stated earlier), but also the name in the database. If the name doesn't match the one on the ticket, we know the ticket is manipulated in some way. Also, if the name which comes up on the scanning devic, doesn't match the name on the ID (which everyone needs to show anyways to prove age), entry is also disallowed.

The only way to bypass this solution is identity fraud, and that of course is beyond the responsibility of the ticket system to check.

Delay

Another way to solve this, in theory, is to add a random time of delay between each request made to the database/validation API. This way, no one would be able to scan their ticket at the same time...because the time of validation is delayed each time with a random amount of milliseconds.

I'm not a fan of this, because it:

  1. makes everything slower at the entrance
  2. isn't effective if it's not delayed hard enough. Because if it takes 50ms for the database to update is_scanned from false to true, the only solution would be to delay it with an interval of minimum 50ms each time.

Other solutions?

What other solutions do you think of to solve this exploit?

O'Niel
  • 2,740
  • 3
  • 17
  • 28
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/96932/discussion-on-question-by-oniel-exploiting-the-delay-when-a-festival-ticket-is). – Rory Alsop Aug 01 '19 at 19:00

8 Answers8

154

The vulnerability you're describing is a race condition.

There are several ways to deal with it, but I would go with a SELECT ... FOR UPDATE SQL query, which puts a lock on the selected rows to prevent new writes until the current transaction is committed.

Be sure to check your RDBMS documentation to check how to implement it correctly:

Benoit Esnard
  • 13,942
  • 7
  • 65
  • 65
  • Thanks a lot! I was just about to ask if there was a way to queue requests made to the same row when in the same timespan. But this is even better. ;) – O'Niel Jul 29 '19 at 12:10
  • 1
    Actually how does this solve the issue? Because you should actually prevent read instead of write. Because when Foo's ticket is scanned row X is locked for write... But when Bar's ticket gets scanned he can still read row X and it will still say valid because is_scanned is false? Doesn't matter if it can update or not? – O'Niel Jul 29 '19 at 12:19
  • Wouldn't it be better to do a manual lock like this? https://stackoverflow.com/a/16119066/6533037 ? Than I could also return false when the column quantity is less than it's default value. Or does SELECT ... FOR UPDATE exactly the same? – O'Niel Jul 29 '19 at 12:23
  • 28
    @O'Niel: a `SELECT ... FOR UPDATE` query will also block other `SELECT ... FOR UPDATE` queries! – Benoit Esnard Jul 29 '19 at 12:25
  • 2
    Locking is definitely the easiest way to prevent this issue. – Tom Jul 29 '19 at 15:00
  • You can also use optimistic locking, q.v. – David Conrad Jul 29 '19 at 22:14
  • @DavidConrad I just looked up the difference between optimistic and pessimistic concurrency control. However, which one do you think is the best suitable in this case? – O'Niel Jul 29 '19 at 22:45
  • 3
    @O'Niel In this case I would think optimistic more suitable, since the chance of actual contention is very, very low. – David Conrad Jul 29 '19 at 23:00
  • The programming language you use for your API also _likely_ offers you the possibility to create locks (Lock, Read, Update, Unlock). The pros being you keeping your locking mechanism even if you changed your data layer. – Thibault D. Jul 30 '19 at 07:52
  • 10
    Good answer for the immediate problem, but there are a lot of failure scenarios following. What if the wireless-scanner loses the connection half way through? What if a person scans the ticket two times by accident? What if a person is scanned, but cannot enter the grounds immediately because of some hold-up ? The actual problem is: **You have two transactions: One in the database and the other the real world state if the person has entered through the gate, which need to be synchronized ideally atomically** which is just not easily possible. So you will have to include a lot of error handling – Falco Jul 30 '19 at 13:00
  • Adding some JWT-like HMAC asymmetric key verification to the QR code can aid in preventing "photoshopped"/forged QR codes by signing the actual payload. A distributed Redis (or similar)-backed lock could prevent double-spending-like behaviors without slowing down your database. – Machinarius Jul 30 '19 at 15:29
  • 2
    @Tom: Even for a festival with 250,000 people and 1,000 gate scanners, I would think a single relatively modest desktop machine would be able to hold the entire data set in RAM, and process lookups, individually, in less than 0.1milliseconds each. What disadvantage would there be to simply serializing everything? – supercat Jul 30 '19 at 16:21
  • 5
    @Falco Your comment is excellent. Indeed, human factors can mitigate these events. I.e. eventually **the guard may just let you in if THEY have a fault** – usr-local-ΕΨΗΕΛΩΝ Jul 30 '19 at 21:31
  • While `SELECT ... FOR UPDATE` will certainly mitigate the race-condition, it comes at the cost of more resources (within the database). It's no big deal if you have a separate database for every event. But imagine you have hundreds or thousands of events occurring simultaneously and thousands of attendees at each event all trying to scan their tickets in a short amount of time. You need to make sure you have the resources to support your use-case. For most RDBMSs, additional locks require more memory. Maybe not much, but multiplied by hundreds of thousands of simultaneous transactions adds up. – Christopher Schultz Jul 31 '19 at 04:33
  • @ChristopherSchultz this highly depends on the database implementation. If your DB supports true row-level locking, there should be no problem in scaling at all. Many DBs support only block-level locking, in this case it can be solved with intelligent bucketing or partitioning of the relevant table. And I can't stress this enough: DBs are made for millions of rows and thousands of concurrent transactions, as long as your events don't include a relevant fraction of all humans on earth you should be fine ;-) – Falco Jul 31 '19 at 07:40
  • 1
    @ChristopherSchultz if someone "have hundreds or thousands of events occurring simultaneously" then for sure he has resources and experts to deal with it without even asking here, right? By the time company grew this large, it got all the time it needed to learn scaling. – Mołot Jul 31 '19 at 14:01
  • @Mołot Retrospective capacity planning usually does not yield satisfactory results. – Christopher Schultz Aug 02 '19 at 13:00
  • @ChristopherSchultz why retrospective? I looked for biggest ticket sellers in USA. My search returned Ticketmaster. They sell ticket to 1214 events this weekend. If they would somehow get money to expand hundreds times more, why couldn't they plan ahead? And if the growth was slow, why couldn't they switch platforms in the meantime? They would have couple hundreds more money to do this than biggest ticket sellers have now, right? And that's assuming they would, at any point, decide not to separate servers per event, per city or per other manageable chunk of workload. – Mołot Aug 02 '19 at 13:10
87

The other solution here is absolutely right and makes sense for larger systems where it's not as easy.

With the data you have, that is relatively simple, you could go for a non-blocking option:

UPDATE [FESTIVAL_TICKET] 
  SET IS_SCANNED = TRUE
WHERE TICKET_ID = @ScannedKey 
  AND IS_SCANNED = FALSE

Now, this is an atomic operation. No two users of the database can issue this and have it update the row. The person who gets returned "1 row affected" (obviously there is a way to find that out in code, do not parse text for this) can go in. Everybody else will get zero rows affected by the statement. If you want to be user friendly, you can now check why it could not be found, whether it was the wrong ID or it's already scanned.

But the important detail is that the statement is atomic. Only one will win, no matter how close they are to zero time difference. Because you no longer have a read and then a write. You have the read and write in one atomic operation.

shellster
  • 568
  • 3
  • 5
nvoigt
  • 1,092
  • 4
  • 10
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/96933/discussion-on-answer-by-nvoigt-exploiting-the-delay-when-a-festival-ticket-is-sc). – Rory Alsop Aug 01 '19 at 19:01
15

The downside to this seems to be that a person might get in for free (with a copied ticket).

For small events that is probably correct.
However if you add too much delay, for any reason, you risk more than a person getting in.
The ticket scanners will just let a few extra people through if their devices jam or are too slow... because, heck most of them probably have valid tickets, right?

I watched this happen at a major event this calendar year attended by thousands of fans of musicians that many people have heard of.
The ticket company was a major one (maybe the one you work for?) and it was at a site that was custom built for ticket taking.
My party was one of the ones let through without a scan (and yes... I had a valid/legal ticket).
I had to stand there and watch the ticket takers a few minutes before I could figure out why it had happened.

TL;DR;
Don't code for every eventuality when people are involved.
Shoot for two or three nines (99%-99.9%) and call it a day.
Save the nick picking for when only machines are involved... that's when you can get a bunch of 9's.

  • 2
    Worst case scenario isn't even that, it's that your security guards do act robotically and tens or hundreds of people with legit tickets get locked out. Your story would be a lot different if that had happened, not just "a funny thing happened to me once". – user3067860 Jul 31 '19 at 15:44
12

This answer has already a great and cheap answer, however, I would add my own answer from both a software engineering and security point of view. This may serve helpful for future similar questions about unlikely exploits.

and while it's possible in theory... In a real scenario, this would probably fail.

And then, what is the potential damage compared to the cost? I am going to prove spending effort and headaches on additional security is not worth the risk.

Now, the solution already proposed and accepted to properly handle race conditions with SQL transactions, shifting the responsibility/cost to the database, is the best, industry-standard and cheaper solution. It could be the end of the case, as the answer was indeed accepted.

As pointed out already, the event that both attendants are scanned in the very exact moment and trigger the race condition exploit could be estimated in odds of millions if not billions of magnitude. To give you a qualitative idea of billionaire odds, read this article about lotteries and find that playing for that SuperEnalotto on top of the second list might be an easy game compared to scan two tickets, and the reward is definitely consistent. The odds represent the exploitability of a vulnerability and that is normally qualificated into discreet levels ([very] unlikely, [very] likely). I always compare non-deterministic security-related events to lottery to provide a more familiar comparison.

To make additional clarification, the odds are influenced by:

  • The ability for the two to synchronize their moving across the queues and hand the ticket to the guard at the same time. This implies the two have constant communcation and trained themselves, not to mention the luck (odds!) that their lines move at predictable speeds
  • The physical movement of the guards. Not all guards take the exact time in millis to scan the tickets, they move the arms at different speeds. One of the tickets may fall off the hands of the guard, one guard may hold the ticket reverse. One guard may turn around to check if the line is not jammed behind them. In other words, there is excessive entropy to plan an attack
    • Unmanned ticket checking machines might not be affected by this factor
  • The time it takes to the computer system to scan the ticket, so that the two scans fall in the same time slot and the vulnerability is exploited.

So here is the software engineering consideration.

Ticket has a price, so is worth X$. I estimate the magnitude of X can be in the order of 50-100. For each person exploiting the vulnerability and entering the facility fraudulently, a loss of $X applies.

Implement more complex checks (e.g. passport name control) is expensive both in the terms of the code required in the software dev phase, and the time it takes for the people to enter the facility. Security guards are paid hourly. Implementing a Ben-Gurion-style security scrutiny* is much more expensive and painful for the honest.

Now, you want to sleep better assured that nobody can exploit your system. How much does it cost? After you paid an exorbitant amount of money to secure your system you might discover that your competitor, running an "unprotected" system, is tolerating a $80 loss with the odds of one over millions of magnitude. It's hard to quantify that probability. Since you have more odds to win the hardest lottery around the world, you better bet for leaving your job for good!

Conclusion: in our profession, winning odds are our best sleeping partner!

Conclusion 2: race conditions attacks can be likely exploited on automated network systems where attackers can obviously synchronize themselves to the microsecond!!! That could also multiplicate the damage, so welcome the best security measures then!

Conclusion 3: if the system is already running, the effort of patching it with the accepted answer (design, development, test, UAT, rollout, PMO...) is more expensive that the potential damage. Please comment below

*I cited that as an example because airport security lines in Israel are legendarily long and thorough

usr-local-ΕΨΗΕΛΩΝ
  • 5,310
  • 2
  • 17
  • 35
  • 2
    This reads like a lawyer turned Project Manager justifying not doing the job to 100% ... As you're using the example in the post, I'll too use the example and give you the cost of fixing the issue is less than a penny or however long it takes for your developer to type @nvoigt's code. If a bad actor discovers this vulnerability exists they WILL find a way to exploit it.Let the person behind them go if their queue is going faster, shuffle around in the bag to find the ticket while the other person gets in position to sync, and also loved "...might not be affected" that was the cherry inthispost – Иво Недев Jul 31 '19 at 08:16
  • I truly expected a comment like yours, sir, and I am happy to expand. Rather than suggesting **not** to do the job, I normally suggest to put that job into the backlog, but obviously still do it. That makes a lot of difference. Unless you have a perfect CD (Continuous Delivery) and cloud-self-update, there is a cost for doing UAT and prod rollout. Especially for ticket machines, rolling the software out probably means updating firmware by the wire. If you push a lot of change, the cost is blended among all the changes. – usr-local-ΕΨΗΕΛΩΝ Jul 31 '19 at 09:10
  • 1
    Also, one thing that project managers (vs coders) normally see is that while it may seem that the cost of typing some code is trivial, there are always costs and risks associated to software changes. It would take me very long to explain, but consider the following true story happened to me: updating Hibernate 3.5.x to 3.5.y after a Black Duck scan reported a vulnerability (eventually unexploitable) broke a production application that was not fully re-tested. It took me 3 seconds to change the Hibernate versions and days to the team to handle the disaster – usr-local-ΕΨΗΕΛΩΝ Jul 31 '19 at 09:14
  • Your comments assume the only update would be the vulnerability handling, while OPs question reads as if he's still in the middle of development. ANd in the last comment and I correct to understand that you changed the software version in prod and only changed the software version, and that broke prod ? – Иво Недев Jul 31 '19 at 09:21
  • It was not clear to me that the software is still under development. Good time to push the change to backlog. Yes, I am saying that I only changed the Hibernate version in a framework project, re-tested the framework, had other developers pull the new framework version and its Hibernate into their project, not do complete re-test, and discover that an HQL was using an illegal syntax **which was tolerated** by the earlier version. This happened many years ago so I don't have full details handy (and I don't know if I can share if I had them) – usr-local-ΕΨΗΕΛΩΝ Jul 31 '19 at 09:26
  • Conclusion 3 still applies IMO. **If** your system is running and you have found a vulnerability, you shall **triage** it first and not **rush** to fix it if you determine that the odds and the damage are ridiculously low. High impact/exploitability vulnerabilities should *still* be patched ASAP, that's out of discussion – usr-local-ΕΨΗΕΛΩΝ Jul 31 '19 at 09:28
  • The race condition bug might be more severe if combined with a different bug. E.g. if there is a bug that severely reduces response time of the system, than the completely unlikely event of a race condition occurring might be manipulated into being very, very likely. According to James Reason's Swiss cheese model: Don't wait until the slices of cheese align and prove your "the effort of patching it [..] is more expensive that the potential damage"-assumption wrong: Fix the bug! – yankee Jul 31 '19 at 18:59
  • 1
    Without endorsing the main point, your post would likely better illustrate your point with a comparison to physical security, such as the possibility of someone hopping a fence, climbing in through a window, or letting their buddy in through some side door. Or maybe with a comparison to other risks, like the guards/venue losing connection to a central database. Fixing bugs and reducing vulnerabilities is good, but I'd rather see time spent on e.g. contingency planning for loss of network connection to a central database, if that hasn't been done already. – WBT Aug 06 '19 at 13:38
4

There is already good answers here that covered much of the database part exploits. But I wanted to add my real life experience, having worked in the event (open air festival) field and designing ticket validation system and applications.

One of the big challenge is network stability, because the assumption that all scanning device have network capability all the time is quite wrong. There can be delays, interruptions or unavailability at anytime during the scanning process and that should not be delaying customer entries to the event (at least from our point of vue, others events may require stricter validation).

In our application, tickets were validated using a signature, but they were synced and committed to the database only when the network was up. The application stored the validated/to be committed tickets in a bucket and tried to commit as much tickets as possible once the network was available. It also avoid doing one INSERT per tickets.

On the farest of the event entry, wifi did not reach at all. To save cost of having another router for just 10 meters more of coverage, the scanning devices could communicate between them and share their connexions. Meaning that if only one of the device had access to the wifi, the others could theorically send their buckets to it, or to the closest device in range that will forward it.

Real life shown us that most of the scanning devices lost connexion at least once in a minute.

So theorically, one tickets could get scanned by as much non connected device as it wants, but only once per device. This is a race condition, but way more trivial to exploit that what other answers mentioned.


A word on the How to prevent it?:

You can prevent race condition, by removing the parallelism, which is what a lock does. It will introduce latencies since you are basically reducing your database capability to accept concurrent writes.

The question then becomes, is it worth ? Can we accept more delay until validation for the assurance of correct authentication ? Will this prevent people from finding an open door or jumping a fence ?

Cyrbil
  • 173
  • 7
2

A good solution could be using a message queue instead of putting in delays. The scanned ticket doesn't get processed immediately, it waits until all sent tickets before it are processed. And the system doesn't return a response unless the ticket has left the queue and has been processed correctly. An argument to it would be it could be slow because everyone has to wait for others to finish. but you can logically shard the ticket IDs, for example: 2 queues, 1 for odd numbered tickets and 1 for even numbered ones. or just simply put group numbers in the id itself.

paolord
  • 121
  • 2
  • For an event with 1,000,000 tickets, even a modest desktop PC could run a program that built a list of scanned tickets in memory and checked for duplicates, at a rate of over a million tickets per second. Even a fairly simplistic Javascript program in a web browser can come close to that speed (I just generated a million random tickets in my browser and counted up 999,870 unique ones in under 1.2 seconds). Even if there were ten thousand gates with people scanning tickets, the worst-case delay would be a fraction of a second. – supercat Jul 31 '19 at 06:21
  • @supercat The gates need to communicate with each other. Tickets aren't usually reserved for a given gate. Your simplistic javascript program would allow ten thousand copies of the same ticket to be accepted. As usual, it's not the CPU that's the limiting factor - it's the IO. Mind, the ticketing systems I worked on had to be tolerant to connection failures, so they always had to err on the side of "rather let people without tickets in than keep people with tickets out". – Luaan Jul 31 '19 at 10:32
  • @Luaan: No, I'm saying all gates funnel send requests through one PC. If a PC takes less than two microseconds to process each request (a level of performance easily achievable in straightforward Javascript in a browser), or even if it takes ten times that, then even if requests were received from all 10,000 gates at once, servicing all all of those requests would take less than a quarter second. – supercat Jul 31 '19 at 14:43
-1

This is a typical problem caused by a mismatch between the design of your accounting system ("accounting" here being tracking the use of any asset or item, not just money) and the real world information you want to track.

You have explained that in the real world it's clearly possible for a code to be scanned twice and you wish to track this situation, yet your database stores only "it has been scanned." This isn't incorrect as far as it goes: SET is_scanned = TRUE WHERE ticket_id = ? will indeed be correct (it the real world the state was "ticket_id ? has been scanned" before, and that is still true now), but that's not actually what you want to record.

Instead of keeping a state, keep a ledger: a table scan with heading ticket_id, scanned_timestamp, scanner_id, with the key being all three of these fields.¹ You are thus recording not just that a ticket has been scanned, but each scan of the ticket. If there are no rows containing ticket_id = ?, you know that ticket has never been scanned, otherwise you have a list of all of the times it has been scanned (and where).

Now that your database is capable of better modeling the real-world information, you can decide what you want to do about this. Your particular DBMS may offer some way around this race condition (INSERT ticket_id, scanner_id INTO scans WHERE ticket_id NOT IN scans or whatever²), in which case you definitely prevent two entries for the same ticket_id. Alternatively, you could simply let both people in and later analyze your records to figure out the reason you appeared to have two tickets with the same ticket_id.

Which way you want to go is actually a business decision, not a technical one. There are plenty of reasons that you might incorrectly issue two valid tickets with the same ticket_id (say, a bug related to attempted returns) or a ticket needs to be used twice by the same person (forgetting to scan at an exit). If the number of people attempting to cheat the system is lower than a certain level, it may be better for your business to eat the cost of a few cheaters than pay the cost (in reputation, lawsuits or otherwise) of denying a valid ticket holder.

In short: match your database design to the actual real world situation, using ledgers rather than states where that applies, and let the business make the business decisions about how to deal with these real-world situations.


¹I'm arm-waving over a lot of the details here to keep the answer short and express just the general idea.
²I assume here an appropriate consistency strategy that has the same effect as serializing these INSERT requests to this table.

cjs
  • 339
  • 1
  • 6
  • 1
    The OP is not asking to "track" this problem, but to prevent it. You have "arm waved" over the mechanism to prevent the race condition, which would be the answer. Your new table idea has the same race condition problem. – schroeder Jul 31 '19 at 07:46
  • This `INSERT`strategy is also prone to race condition, adding a unique key on `scans(ticket_id)` should be enough to fix it, though. – Benoit Esnard Jul 31 '19 at 07:47
  • @BenoitEsnard You must _absolutely not_ add a unique key on `scans(ticket_id)`; that is the whole point of my answer. And no, there is no race condition if you use the right consistency model. I've updated the answer to try to explain both of these. – cjs Jul 31 '19 at 08:12
  • @schroeder My point is that the question is only valid for certain business conditions, and the poster should make it clear that he's considered and explicitly made that business decision before addressing this problem, otherwise it doesn't exist. Typically most people I find asking this question have not done so. – cjs Jul 31 '19 at 08:14
  • 1
    I agree with other commenters that this strategy still has a race condition; there is no "right consistency model" discussed in the answer that prevents this, other than the single sentence beginning "Your particular DBMS may...". If you are both preventing _and_ logging, you can use one of the other answers, and create an "attempted scans" log which can be analysed later to see how many people were turned away. So the technical detail in this answer is a bit of a distraction: it's actually a _frame challenge_ of the question, asking if _any_ solution is actually necessary. – IMSoP Aug 01 '19 at 12:27
-4

In Clojure they have a feature called atoms to take care of such issues.

Two threads come in. They both read value is_scanned = false and proceed. After processing their stuff they return is_scanned = true to update atom. One thread however will come some moment quicker.

The thread that comes in tiny bit later will have the atom changed, so that function will not be able to update the atom. Instead it will be told to rerun itself with the now updated value of is_scanned = true. And now the return will be that the other ticket is no longer valid.

Here's a video explaining this feature of clojure.

Džuris
  • 269
  • 1
  • 7
  • 9
    I'm not really sure how this helps the person asking this question specifically, as they strongly implied they're using a relational database and want to put their solution there... Nor do I see how this helps developers in general to make a system more secure, as race conditions aren't exclusive to Clojure (or any one language), and Clojure doesn't hold a monopoly on how to handle race conditions. – Ghedipunk Jul 29 '19 at 21:19
  • 11
    Also, even if your software is written in Clojure, what if you have so many requests that you decide to scale your application horizontally, running multiple nodes? No solution in memory is going to work; you need to solve it in the database. – David Conrad Jul 29 '19 at 22:16