Fraud detection to avoid fake users

Question

I know there is no 100% solution for fraud detection, but at least I want to set some level of confidence in this use case.

Suppose that I have a system where:

A user need to make a registration
A user can only post one review per company product

I am doing 4 steps to avoid fraud (e.g. multiples posts from fake accounts):

Step 1 - Each account allow only one vote per product.

The malicious user need to create multiples accounts to fake reviews.

He need a new email to create a new account.

He have to complete the google recaptcha

Step 2 - invalid review if this combination is true:

user IP + user browser fingerprint combination already exist for the chosen company product (using the IP to avoid collisions of browser fingerprint)

Step 3 - invalid review if this combination is true:

same user IP used for the chosen company product in the last two days

Step 4 - If everything fail, the user can post a review:

In moderation area, if same product has a duplicated fingerprint, then the review will be marked as potentially faked and needs to be approved accordingly

Is there a simple way to improve this mechanism of fraud detection?

Any advice about fingerprint collisions?

So what happens if you block the IP of an entire organization or company with tens of thousands of employees? — Mark Buffalo, Feb 11 '16 at 11:40
@MarkBuffalo It's a big coincidence, that multiple users will post a review about same company product in workplace, and sequentially, I mean, with an interval of less than two days. (step 3). btw, I will not block the ip, I only allow the review after two days. — user455318, Feb 11 '16 at 11:44
If someone were to truly have a vendetta against your company's products, what would keep them from getting on TOR or a VPN and making reviews? — Lutefisk, Feb 11 '16 at 12:30
@Lutefisk if you want to go extreme you can block all tor endpoints and known vpns — BlueWizard, Feb 11 '16 at 12:35
@Lutefisk block tor nodes isn't a big deal https://github.com/shemminga/small-hacks/blob/master/blocktor/blocktor . btw, Tor or VPN will not change browser fingerprint — user455318, Feb 11 '16 at 13:00
Wouldn't it though? I mean it could possibly change your user agent, plugin details, time zone... — Lutefisk, Feb 11 '16 at 13:19
@Lutefisk agree. However only technical users know what is a user agent or change the fingerprint intentionally. The idea here is just a basic fraud detection, no a bullet proof solution. — user455318, Feb 11 '16 at 14:32
@JonasDralle Setting up a VPN is so easy. No need to use a public VPN. — ferit, Mar 23 '16 at 21:12
Being anonymous in the internet is also easy. You just need to go in an internet cafe or other public place where theres not your IP adress. — BlueWizard, Mar 24 '16 at 05:03

AdHominem · Answer 1 · 2016-02-11T14:44:15.000

2

Edit: If your website is about crowdsourcing, you probably should have mentioned that in the question right away because that's a very specific topic and different from online vendor scenarios.

Yet the solution remains the same. The simplest way would be to hold any verified activities for each user in a database and thus verify if a user really has accomplished something. That's how shops like Amazon and crowdsourcing projects like Wikipedia do it and it's not really hard to implement either, especially when all users need to register anyways.

You will always have data such as member ID or project ID which can be mapped 1:1 to an existing person. I can't even imagine a company where the owner doesn't even keep track of the participants of his business.

edited Feb 11 '16 at 14:44

answered Feb 11 '16 at 11:43

AdHominem

3,006
1
16
26

No way, I can't do that. I can't track this type of information. But yes, its the ideal scenario. – user455318 Feb 11 '16 at 11:45
Pushes fake reviews to a higher cost level, though. – Deer Hunter Feb 11 '16 at 11:48
The project is not a store, don't even sell anything, The project is about crowsourcing feedback. – user455318 Feb 11 '16 at 14:36

score 1 · Answer 2 · answered Feb 12 '16 at 03:21

I would suggest making users use the "something I have" two factor authentication. One example is to send text message to verify the user upon registration or posting.

Or limit customers ability to post if the account is new and have a "Wait time".

You could go further and require geo-location on a mobile app that checks if the person is within the area that the review is being placed.

The second step would be to limit the number of reviews per hour/minute for each company/product.

Again.. All of this is not as easy to implement and puts more of a burden on the customer.

"Wait time" seems interesting! not too invasive for a normal user and probably tedious for a malicious user that will create multiples accounts. — user455318, Feb 12 '16 at 09:42

score 0 · Answer 3 · answered Feb 11 '16 at 12:57

0

Kudos for trying to present honest reviews - a somewhat unusual take when most business models involving user-submitted reviews tend not to favour such an approach.

Regarding what you are doing currently....

You've only told us part of the process here. One common feature of many online sites is that they insist on establishing trust up front before allowing a user to do anything. I find this particularly annoying. I don't want to provide my postal address, my age and shoe size before I'm allowed to share my experience of a product, good or bad. Indeed I'm only likely to jump through such hoops if I really want to post a vitriolic review. Hence I would suggest that any part of the process directly involving the user take place after the review is logged (but before it is displayed). Sending an email with a confirmation link is an easy, minimally obtrusive way to do that.

Using browser fingerprinting, both on submission of the review and validation will give a good indication if the same browser submitted the review and confirmed it but people can have multiple devices.

user IP + user browser fingerprint combination already exist

IP addresses for clients are rarely static. Even using a subnet is not all that efective. The ASN (or its associated ORG record) will give you less granular, but much more consistent, accurate results. Using the ASN number also simplifies the process of identifying the locality of the client address; you may wish to restrict the reviews to the countries where the product is available. There are organizations (my experience is that there is an unusual density in the Phillipines and Eastern Europe) who will carry out black-ops marketing.

Any advice about fingerprint collisions?

Try to avoid them?

For some reason, companies offering these services are somewhat reticent about publishing stats on the uniqueness of their solution - and there seems to be very little literature (beyond the original panopticlick study) comparing methodologies.

You might some useful pointers in the blog post here. I'd previously reported 1056 unique hashes from 1160 different devices using some of those methods. I've since updated my methodology to include canvas fingerprinting which ramps up the uniqueness a lot.

Sooner or later you'll find that you have lots of data for each review - multiple IP addresses, cookies, fingerprints (and potentially vocabularies, writing style and others). You may find its more appropriate to associate a weighting to the different flags detected in a similar way to how spamassassin works.

answered Feb 11 '16 at 12:57

symcbean

18,278
39
73

1

Using ASN to "identify" users is like using nationality to identify someone: it's not going to be very selective. Browser fingerprinting is useless to prevent people from re-registering: either private browsing makes it easy, or you're going to lock out a whole class of users. Overall, your suggestions are going to create an awfully lot of false positive while adding a lot of complexity: not really a good tradeoff. – Stephane Feb 11 '16 at 16:06
1

Stephane, you really should do some research before making comments like this. Go read the Panopticlick paper (they're not using ASN numbers, but they do use timezone which is even less effective). While you're at it, you might try visiting the site a few times with "private browsing" enabled - the whole point of browser fingerprinting is that it bypasses the mechanisms in private browsing to get a consistent identifier for your device. – symcbean Feb 11 '16 at 17:31
However fingerprint will not work using multiples browsers, only with evercookie. @symcbean augur.io will work too, but that's incredibly expensive. 500$ month – user455318 Feb 11 '16 at 17:42
I wouldn't describe $500/month as incredibly expensive, though it may be in the context of your project. It was your suggestion to use fingerprinting. Evercookies are not completely foolproof, and pose legal issues in europe, otoh its relatively easy to detect and block access for someone going out of their way to defeat active fingerprinting. BTW, augur is using fairly ordinary active fingerprinting, there are a few open source projects offering comparable functionality without the cloud database x-ref. – symcbean Feb 11 '16 at 23:50
@symcbean yes, like fingerprintjs2, but the interest of augur is that tool will generate the same hash for multiples browsers. fingerprint js will fail completely in this step. – user455318 Feb 12 '16 at 09:38
This is pointless: the proposed mechanism (beyond being incredibly intrusive) only works with honest users who keep using the same browser. It does NOT fit the context of the original request. In fact, only the advertisement industry could be (seriously) interested in such a system. – Stephane Feb 12 '16 at 10:05
@Stephane: clearly you know more about this than myself, 41st Parameter (now Experian), Threatmetrix, Blue Cava, EFF and Augur. Perhaps you would like to volunteer a solution to the problem? – symcbean Feb 12 '16 at 13:34
@user455138: Augur (and 41st Parameter, and Threatmetrix) only do 2 things that fingerprint.js doesn't. They have a database of identities linked to accounts at different providers and they also use clock skew to differentiate between appliance type devices. Device identification is all about trading off specificity against volatility - since clock skew is highly volatile, it needs to be rounded off massively to give consistent results within a fingerprint (or kept separate from the fingerprint and fuzzy matching used). – symcbean Feb 12 '16 at 13:39
@symcbean thx for the explanation. When you said: database of identities linked to accounts at different providers, What identities and providers are you talking about? just out of curiosity, nothing more. – user455318 Feb 12 '16 at 17:16
That's really a whole new question. – symcbean Feb 13 '16 at 23:46

score 0 · Answer 4 · answered Mar 23 '16 at 20:38

0

The malicious user can just change his browser fingerprint. At least, he can have two or more different browsers installed on his machine. In addition, What about multiple users who are using the same IP address (e.g. behind NAT) or using a shared home computer.

I would recommend that you look at Opinion Spam Detection: Detecting Fake Reviews and Reviewers

I think this will help you to understand how spam reviewers think. In general, I would try to calculate a score for each review.

Faked reviews created by the same spam reviewer will share some patterns. e.g. time of day the review submitted, typing speed (how much time he took to type the review) the reviewer geolocation, language style and typo, etc.

answered Mar 23 '16 at 20:38

Ubaidah

1,054
6
11

There are infinite ways, most of them, inconclusive or really complex, like precise geolocation, typos, and so on. If we implement everything we read then you will have tons of validations, problems with the algorithm evaluation speed, and after all, the algorithm still fails or have false positives. I am not sure about the complexity vs improvement curve. btw, ok, Lexical features, Content and style similarity or Semantic inconsistency. How do you implement that? Any library or api? – user455318 Mar 24 '16 at 10:35
you are absolutely right, we can not just implement any ideas. By innovative and creative solutions are mostly achieved by building unusual ideas. Yes, of course, there are many NLP libraries and Web services for text analysis, semantic analysis and similarity measures. For instance https://www.meaningcloud.com/. From my experience the IP address and the browser fingerprint are easy to fake to avoid tracking. A malicious reviewer who gets paid to post a review will take the time and the effort to bypass your model. – Ubaidah Mar 24 '16 at 18:18

Fraud detection to avoid fake users

4 Answers4