Is it possible to protect my site from HTTrack Website Copier or any similar program?
Without setting a max number of HTTP request from users.
-
95Easy solution: Just take the website down. – CodesInChaos Jul 05 '13 at 09:57
-
Web application firewall from imperva claims to be able to detect and block such activity in a number of ways. – NULLZ Jul 05 '13 at 11:28
-
3HT Track respects exclusions in robots.txt, but "any similar program" might not. – TRiG Jul 05 '13 at 16:28
-
43But wait, aren't you grabbing other people's websites? http://stackoverflow.com/questions/17440236/getting-java-to-save-a-html-file – Gaff Jul 05 '13 at 19:50
-
@ROFLwTIME no that was just me playing with java, but that program is what triggered this question. I wrote that program in hopes to try to prevent it from happening, the program was built with success, but the prevention failed. lol – h4ck3r Jul 06 '13 at 04:59
-
13This kind of question is so common, yet so absurd. If you put a website online, when people "view" it, they are downloading it to their computer, you are essentially giving it to them. Then the question becomes "I gave my website to someone, how can I not give it to them?". – laurent Jul 06 '13 at 06:21
-
but why you need this? its meaningless requirement? if your inbox is publically showed then you can prevent but here website is just for giving information and after creating website you does not want to give it? then what is the motto of your website? – Java D Jul 06 '13 at 07:57
-
it was not meant to be a requirement I just really wanted to know if it can be done. Just my curiosity. – h4ck3r Jul 06 '13 at 08:01
-
4You make the entire site out of some form of server-side code (that way the only things the user would be able to download is what is sent to them). – PixelArtDragon Jul 07 '13 at 00:46
-
this would only really be usefl for the simplest of sites – NimChimpsky Jul 08 '13 at 10:50
11 Answers
No, there's no way to do it. Without setting connection parameter limits, there's even no way to make it relatively difficult. If a legitimate user can access your website, they can copy its contents, and if they can do it normally with a browser, then they can script it.
You might setup User-Agent restrictions, cookie validation, maximum connections, and many other techniques, but none will stop somebody determined to copy your website.
- 43,808
- 16
- 135
- 167
-
21You could also watermark all of your images so you could later prove they were originally yours. However that won't prevent the copying it will only help after the fact. – Four_0h_Three Jul 05 '13 at 14:17
-
15And don't forget to mention any such measure is more likely annoy legitimate visitors than to hinder someone dedicated to actively circumvent them. – Tobias Kienzler Jul 06 '13 at 14:06
-
-
@Four_0h_Three Watermarks are largely pointless. ANY reasonable watermark can be removed in PhotoShop in 30 seconds (or less). If your watermark can't be removed that easily, then it's probably in an obnoxious location and is completely ruining any enjoyment your users would have gotten from the content. Fact: there is no way from getting viewable hosted content ripped off by someone that really wants it. Accept that fact, and continue on with your life. – Mike Mar 24 '16 at 15:19
-
Basically, if there's any way to prevent it, there's a way around it. – Fiasco Labs Jun 27 '16 at 17:00
Protect the part of the site you want to protect with a username and password. Then only assign a username and password to people who sign an NDA, (or similar) that says they won't extract or copy information from your site.
Another trick is to make all your content load from AJAX... and make the AJAX data URL load from paths that change (such as ~/todays-date) and sync that with javascript. Then even if someone were to download your content, the data would be out of date within 24 hours.
Even then, nothing will prevent a determined skilled attacker from getting an offline copy, you can just make it harder so it's not worthwhile.
- 50,090
- 54
- 250
- 536
-
9To be fair, it need not be AJAX. Any dynamically generated content regardless of the underlying technology will work in same way - the attacker could easily copy a snapshot of its output only, while the backend (application, database,...) involved in generating these contents is not supposed to be accessible to an unauthorized actor by any other means. – TildalWave Jul 05 '13 at 11:44
-
2"NDA, (or similar) that says they won't extract or copy information from your site." - This is laughable. How are you going to enforce such a contract? Will your users tolerate it? – Freiheit Jul 05 '13 at 18:11
-
3@Freiheit The OP doesn't say if he has a public site or a small B2B site for professionals. If the audience is small, he can target them with identification, etc. How do you enforce it? There are paid services that scan for copyright theft on other sites. Also steganography can be used to track violators on a per-username or IP basis. – makerofthings7 Jul 05 '13 at 18:56
As has @Adnan already pointed out in his answer, there is really no way of stopping a determined person from copying snapshots of your website. I used the word snapshots here, because that's what such content scrapers (or harvesters) are really copying. They don't (or at least shouldn't) have access to your backend where your website contents are actually generated and displayed to the end user, so the best they can do is copy its output, one that you can generate in such a way to change in time or adjust according to its intended recipient (DRM schemes, watermarking,...), as has @makerofthings7 pointed out in his answer.
So this much about what's already been answered. But there is one thing about this threat that I feel haven't yet been well covered in mentioned answer. Namely, most of such content scraping is done by opportunistic and automated web crawlers, and we see targeted attacks a lot rarer. Well, at least in numbers - bear with me.
These automated crawlers can actually be blacklisted quite effectively through the use of various WAFs (some might even use honeypots to determine the threats in heuristic ways) that keep updated database of blacklisted domains (CBLs or Community Ban Lists, DBLs or Domain Block Lists, DNSBLs or DNS-based Blackhole Lists,...) where such automated content scrapers are operating from. These WAFs will deny or grant access to your content serving web application based on three main approaches:
Deterministic blacklisting: These are detections based on characteristics of web requests that content scrapers will make. Some of them are: Request originating IP address, Reverse DNS resolved remote hostname, Forward-confirmed reverse DNS lookup (see explanation in one of my questions here), User agent string, Request URL (your web application could for example hide a honeytrap URL address that a content scraper might follow in one of its responses, after it determines the request didn't come from a whitelisted address such as legitimate search engine crawlers / spiders)... and other fingerprint information associated with automated web requests.
Heuristic blacklisting: This is a way to determine a threat either by weighting parameters of a single web request described in the deterministic approach (anti-spam filters use a similar approach based on calculating Bayesian probability), or by analyzing multiple web requests, such as: Request rate, Request order, Number of illegal requests,... that might help determine, if the request comes from a real and intended user, or some automated crawler.
External DNSBL/CBL/DBLs: I've already mentioned relying on external DNSBL/CBL/DBLs (e.g. Project Honey Pot, Spamhaus, UCEPROTECT,...), most of which are actually a lot more useful than merely keeping track of spammers and spambot infected hosts, and keep a type of offense (e.g. forum spammer, crawl rate abuse,) on top of IP addresses, hostnames, CIDR ranges,... in blacklists they publish as well. Some WAFs will come with the ability to connect to these databases, saving you the trouble of being targeted by the same actor that might have been already blacklisted for same detected activity on another web server.
Now, one thing needs to be said quite clearly - none of these methods can be considered bulletproof! They will remove the majority of offending web requests, which is valuable on its own and will let you focus better on those harder to detect offenders that somehow bypassed your protections.
There are of course countless techniques for both automated crawlers / content scrapers detection (and their own countermeasures - detection avoidance techniques) that I won't describe here, nor list all possible WAFs and their capabilities, not wanting test your patience or reach limits of the purpose of this Q&A. If you'd like to read more on what techniques can be employed to thwart against such unwanted visitors, then I recommend reading through the documentation on the OWASP Stinger and OWASP AppSensor projects.
Edit to add: Suggestions from HTTrack authors can be read in the HTTrack Website Copier FAQ: How to limit network abuse - Abuse FAQ for webmasters document, and the reasons why a single deterministic method of detection won't work (short of blacklisting offending IP addresses after the fact or through experience of other honeynets), if the adversary is set to obfuscate spider's user agent
string by setting it to any of the many user agent strings of real and legitimate web browsers, and disrespect robots.txt
directives, become rather apparent by glimpsing through the HTTrack Users Guide. To save you the bother of reading it, HTTrack includes simple configuration and command line flags to make it work in stealth mode and appear just as benign as any other legitimate user to simpler detection techniques.
- 10,801
- 11
- 45
- 84
-
1If a blacklisting was really so useful, Google would use it instead of simply limiting access to their google.maps. No matter how great you construct your blacklisting, you eventually end up aggravating and even alienating legitimate users. BTW: Making a buck was not Google main objective (not that they do not like money), but limiting abuse (automated data harvesting). And even use limits are easy to circumvent - different user IPs for once. So @CodesInChaos comment stands. – Jeffz Jul 06 '13 at 15:26
-
@Jeffz - YMMV and is of course application specific. That said, I don't see the relevance of your comment. I (and many others) have already mentioned rate limiting or other time / client based quotas as a possible way to defend against content theft. And of course blacklist sensitivity CAN be dynamic, and entries can be automated, based on approaches I described. I disagree - blacklisting is useful, but of course to a limited extent. Please read at least the bolded parts of my answer, you might notice I've already mentioned it's hardly considered bulletproof. But it will help! ;) – TildalWave Jul 06 '13 at 15:34
Everything the human user sees, he can record. As @Adnan points out, this is rather easy, and can be automated.
However, some sites still have some relative success at deterring mass slurping. Consider, for instance, Google Maps. Many people have, occasionally, tried to recover high-definition maps of large areas through scripting. Some have succeeded, but most were caught by Google's defences. It so happens that it is difficult to make an automatic downloader which acts, from the point of view of the server, as if it was under human control. Humans have all sorts of latencies and usage patterns which a shrewd sysadmin can notice and check on.
Similar tricks are done on, for instance, Stack Exchange. If you try to automate access to the site, you will soon be redirected to a page with a CAPTCHA.
Ultimately, this kind of security is not very satisfying because the defender and the attacker are on equal grounds: it is cunning against cunning. So, this is expensive: it requires thinking and maintenance. However, some sites do it nonetheless.
A generic way for attackers to defeat anti-automation safety measures is to "automate" the slurping with actual humans. Very cheap human workers can be hired in some countries.
- 168,808
- 28
- 337
- 475
-
1
-
5
-
1@Sklivvz What kind of moron tries to scrape SE if they could just download the dump made available anyway? – Tobias Kienzler Jul 06 '13 at 14:08
-
2@TobiasKienzler the kind of moron that finds it easier to simply reskin a copy of the site instead of writing all the presentation layer... :-) – Sklivvz Jul 06 '13 at 22:43
I'd qualify what @Adnan says to add that while there's no way in general to prevent site leaching over time, a specific tool may exhibit behaviour that can be detected with some certainty once some amount of requests have been made. The order that URL's are accessed may be deterministic, such as depth first, breadth first, ascending or descending alphabetically, order that they appeared in the DOM and so on. The interval between requests may be a clue, whether the agent successfully executed some javascript code (NoScript and similar aside), client support for browser performance API, time spent between requests relative to page complexity, and whether or not there is a logical flow between requests. Unless a website leacher takes this into account, you may be in with a chance. User agent checking should not be effective as a good leacher would pretend to be a known bot, so unless you want to exclude Google and other search engines too, knowledge of the IP's that search engines use would be useful.
- 521
- 3
- 8
First of all, the only way you can prevent your site from being copied is actually never make it public to no one but you.
One way you could try to persuade people from doing it is with legal means, I'm not a lawyer so I don't know what steps you should take, if your content is original you could restrict the copyright or something similar.
I think that if you fear your site may get copied It has to be a really really really great web site.
- 141
- 3
Short answer, no, if the user loads a page, then the user can copy HTML by viewing the source.
If the website copier has a particular user agent, you can block that. See Stack Exchange for details.
Another solution might be to make a Flash webpage; those are hard to copy by hand anyways.
Otherwise, I'd put everything into a directory that has restricted access that only server-side PHP scripts can retrieve. Then if the page is built with many includes (one for a nav bar, one for header, one for javascript, one for footer, one for body content) then make another directory of php files that read the protected directory with includes, then make an AJAX that dynamically loads those PHP files. It would make it hard for anything to copy it without rendering the JavaScript (though I don't know if that would stop the software or an individual with a live code inspection tool.
Or you can put some kind of human-verification on your site so a protected PHP directory include isn't called unless the user specifically clicks a non-link DOM object (like a line that says "enter here") that triggers the content to load.
Disclaimer: this is an evil answer. I do not condone any of the following.
Modern browsers are capable of generic (Turing-complete) computation, through Javascript and possibly other means. Even their basic HTML + CSS rendering engines are incredibly elaborate pieces of software, capable of displaying (or hiding) content in a variety of ways. If that wasn't enough, all modern browsers make graphic primitives available, for example through SVG and Canvas, and allow downloading custom fonts to render text with.
If you put all this together, and some more, you will find there are a number of layers of execution between the site's source code and the pixels that make up the letters and words that the user can read.
All these layers of execution can be obfuscated and/or exploited.
For example, you can generate markup that has little or no resemblance to the graphic output, to make looking at the HTML source of your website an exercise in futility. You could use one HTML tag per letter, reordering them with a creative use of float:
and position:
, hiding some of them with complex, generated CSS rules, and adding some more that weren't there, with CSS-generated content.
You can create a font that uses a custom mapping between character codes and glyphs, so that copy and pasting your content would yield utter garbage, or even swear words! You can split letters in two or more pieces and use Unicode combining characters to put them back together. You can do all this with a dynamic generator, creating a new random masterpiece of obfuscation for each HTTP request.
You can write a program that will create complex Javascript algorithms, that when run on the client will fill in some required pieces of the puzzle, so that without Javascript support and a decent amount of client CPU time, the markup alone would be useless. 50ms of modern CPU time are unnoticed by most and are enough to run some pretty wicked algorithms.
Bonus points if you try to scrape your own obfuscated website using a headless browser, in order to have a full CSS and Javascript stack. Then try to find ways (or heuristics) to tell a real browser from the headless one. Then put some traps into the generated Javascript code, so that when it falls into the headless browser case, the algorithm goes into an infinite loop, or crashes the browser, or generates profanity and seizure-inducing flashes on the screen.
These are off the top of my head, there are (countably) infinite other ways to f*** with people's computers.
Now be a good boy/girl and take your blue pill :-)
- 435
- 3
- 7
-
What did I just read? Actually obfuscation, use of JavaScript and any other _wicked algo_ has really little benefit to hiding and/or otherwise messing with the output that website harvesters will be able to interpret. These are nowadays incredibly advanced and not any less competent than the best of browsers out there. Take for example the Chromium project, a full-blown browser component as competent as Chrome itself (which it actually is, minus the eye candy) that can be easily integrated into any web scraping application. So the snapshot will be taken on _DOM ready_, no biggie. – TildalWave Jul 06 '13 at 01:47
-
You could take a look at David Madore's Book of Infinity, it's a small CGI program that generates an infinite number of pages to punish mass downloaders who don't respect robots.txt – loreb Jul 06 '13 at 12:06
-
@loreb - I honestly don't get it. So your website is being scraped by some cloud hosted crawler that disrespects your `robots.txt` and you do a self-inflicting DDoS on your website as punishment for that crawler? How is that going to work? You realize that you'd only needlessly add to server load and exhaust its resources (CPU, memory, bandwidth,...), if the crawler is distributed, has seemingly limitless bandwidth and doesn't care about its crawl rate? You should drop its requests ASAP, not give it more work to do. – TildalWave Jul 06 '13 at 14:21
-
@TidalWave sure, the book of infinity is a joke program, both in attitude (insistance on reproducing the same meaningless "book", and not just random content) and in practice, exactly as you described. That being said, if I were to take my suggestion seriously, I'd defend it stating that (1) the OP mentioned HTTrack in a way that suggests single users mass-downloading a website rather than a distributed crawler, and that (2) one could use the book of infinity to generate a tarpit, similar to OpenBSD's spamd. – loreb Jul 07 '13 at 15:34
First of all, as others have said - anything that you can see you can copy, using various methods. It depends why you want to prevent your website being copied, but the most effective method would probably be to add watermarks so that everyone knows where its come from. Perhaps even a polite notice asking people not to copy your website wouldn't go a miss.
However, going back to your original question and how to stop software from copying website, I believe CloudFlare has a web application firewall. I certainly know that Acunetix Web Vulnerability Scanner won't scan a website that uses CloudFlare. It's a free solution and it should also help speed your website up.
There is now foolproof solution though and anything can be circumvented. The best thing you can do is use a combination of the answers here, depending on how badly you need/want to protect your website. The best advice though, is if you don't want it copied, don't let people have it.
- 501
- 1
- 4
- 10
Even AJAX with date parameters can be duplicated. I've scraped sites with heavy AJAX using GET/POST parameters. If I really need to emulate the browser I can just use selenium or something of that sort. I can always find a way to scrape a site if I really wanted to. Captcha is probably the most difficult thing to deal with. Even then there's the Captcha sniper and other modules to assist in these areas.
- 131
- 2
Look at this links you may get solution from this :)
Use robots.txt to prevent website from being ripped?
OR
The simplest way is to identify the browser id who is browsing your page , if it is htttrack block it, ( you need to configure your server or use ur programming skill to load the different page accordingly )
Thanks..
- 135
- 1
-
3HTTrack is an open source application. You can easily modify the source and override any mechanism that respects `robots.txt`. – Adi Jul 07 '13 at 15:40
-
There isn't a single deterministic method that would identify HTTrack clients if they're set to obfuscate their signature and disrespect `robots.txt` directives. Not without resorting to a lot more advanced methods of detection. Quoting [HTTrack user guide](http://www.httrack.com/html/fcguide.html) we get these two reasons why your suggestion wouldn't work: _"The 'User Agent' field can be set to indicate whatever is desired to the server"_ for your suggestion on using UA, and _"`sN` follow `robots.txt` and meta robots tags (`0`=never,`1`=sometimes,*`2`=always)"_ for blocking in `robots.txt`. – TildalWave Jul 16 '13 at 08:25