30

I have found out that McAfee SiteAdvisor has reported my website as "may be having security issues".

I care little about whatever McAfee thinks of my website (I can secure it myself and if not, McAfee definitely is not the company I'd be asking for help, thank you very much). What bothers me, though, is that they have, apparently, crawled my website without my permission.

To clarify: There's almost no content on my website yet, just some placeholder and some files for my personal usage. There are no ToS.

My questions is: Does McAffee have the right to download content from / crawl my website? Can I forbid them to do so? I have a feeling there should be some kind of "My castle, my rules" principle, however I basically know nothing about all the legal stuff.

Update: I probably should have mentioned my server provider sends me emails about SiteAdvisor's findings on a regular basis - that's how I found out about their 'rating' and that's why I'm annoyed.

Kritzefitz
  • 113
  • 5
kralyk
  • 487
  • 5
  • 11
  • 78
    Would you say that humans have the right to view your website? If yes, why discriminate against humans' robot servants? If not, why is it a website in the first place? – jwodder Aug 14 '14 at 20:05
  • 47
    How did you find out that SiteAdvisor flagged your site? You didn't view *their* site did you? If so, what gave you the right? – Joe Sniderman Aug 14 '14 at 20:09
  • 17
    Incidentally, I wouldn't dismiss the SiteAdvisor report so lightly, in general when I saw similar reports they were legitimate. The most common case is having an older/unpatched version of popular CMS (WordPress, Joomla, Drupal, ...) exploited by some automatic script to place malicious content ("trampoline" pages used for spam/phishing, hosting of viruses linked in scam emails, browser exploits, you name it); you may be hosting bad stuff without even knowing. Also, since many users rely on such tools, you typically want to have a clean record, since such warnings can scare away users. – Matteo Italia Aug 14 '14 at 20:36
  • 35
    If you want something locked down, lock it down. You put the website up and configured the server to respond to GET requests. You've invited everyone in - literally, everyone. This isn't an "implied" right, it's how webservers work. Barring, as noted, robots.txt, or IP restrictions, or content restricted to logged-in users. – mfinni Aug 14 '14 at 21:04
  • 1
    @mfinni A robots.txt file hardly represents "locking down" since it can ignored easily. – Casey Aug 14 '14 at 21:15
  • 3
    @emodendroket - you're correct, but it keeps well-behaving bots out. Without one (which is kralyk's case), it's open season. – mfinni Aug 14 '14 at 21:16
  • 1
    This looks like a legal question. I'm not sure whether legal questions are on-topic for this site. (Also, they tend to vary based upon the specific jurisdiction, which you don't specify, though that is probably secondary in this case.) – D.W. Aug 14 '14 at 21:43
  • How did they find the website? If they found it through a link, I'd say a link does imply you are allowed to get what that link is pointing to. If they found it because a user of their service specifically asked them, what they think about the security of your site, then I'd say that it is really what that user is permitted to do, which counts. Which tool the user uses to view the site would be their choice. – kasperd Aug 14 '14 at 22:16
  • 2
    I do not have an y definite answer, but I am amazed at the replies here. So if I have a website, I expect people to get to it and use it. Undoubtedly correct. But how come that implies that any and all automated access processes are ok too? Does it follow that as it is ok yo have 1 million humans trying to get to my site all at once, it is also ok to have someone's robots doing the same? So, e.g., denial-of-service attacks are perfectly ok? –  Aug 15 '14 at 08:46
  • 20
    @RolazaroAzeveires: Automated processes are okay not because allowing human visitors implies it, but because, barring attacks, they ask nicely: _"can I have these files?"_ and you've configured your webserver to respond: _"Of course! Here you go. Need anything else?"_ That's not crawling without your permission, that's crawling with your permission. – Marcks Thomas Aug 15 '14 at 10:05
  • “Does McAffee have the right to download content from / crawl my website?” — That depends what you mean by “the right”. – Paul D. Waite Aug 15 '14 at 10:49
  • 6
    @RolazaroAzeveires: a denial of service attack is different from crawling. I don’t think anyone’s said that *all* automated processes are okay, they’re just pointing out the illogicality of being okay with all requests caused by an individual human click of a mouse, but not being okay with any requests caused a program written by a human. – Paul D. Waite Aug 15 '14 at 10:51
  • @JoeSniderman I didn't go to their website, they send reports to my provider who then in turn sends me mails about my website maybe having issues... – kralyk Aug 15 '14 at 11:30
  • 9
    @RolazaroAzeveires A browser is also a program which performs an automated sequence of requests to the server based on an initial action by a user. The sequence of requests performed can depend on the user's choice of browser and plugins. The sequence of requests performed by a crawler is usually distinguishable from the one triggered by a browser, but in principle there isn't as much difference, as you seem to suggest. – kasperd Aug 15 '14 at 12:18
  • 2
    @RolazaroAzeveires legal systems take "intent" heavily into account. i'm not a lawyer, but key questions in a legal case are: (1) what harm was done? (2) was the harm intended? (3) what are the generally prevailing expectations? It isn't hard to legally distinguish a denial-of-service attack from other web crawling. If the internet didn't have the open approach that it has, including web crawling, there never would have been a Google. Imagine a web where you couldn't find out what was there, except via proprietary directories, like the old days of telephones and phonebooks. – ToolmakerSteve Aug 15 '14 at 21:57
  • 1
    @jwodder There is a case back in 2000 which was very public about ebay being awarded an injunction against a company crawling their site. Clearly, the public can access but that does not mean everything is fair game. You can read about the injunction here: [ebay v bidder's edge](http://en.wikipedia.org/wiki/EBay_v._Bidder%27s_Edge#Order). – John Aug 18 '14 at 12:19

4 Answers4

90

Yes, they have the right to do so - you've created a public website, what makes you think they don't?

You too, of course, have the right to stop them. You can ask them not to crawl your website with robots.txt or actively prevent them from accessing it with something like fail2ban.

Alternatively, don't worry about it and continue on with your life. It's not hurting anything and is definitely on the benign side of Internet probing.

Brendan Long
  • 342
  • 1
  • 11
Dan
  • 15,280
  • 1
  • 35
  • 67
  • 4
    > _"Yes, they have the right to do so - you've created a public website, what makes you think they don't?"_ Well, if something is technically possible it doesn't necessarily mean it's legal. For instance, YouTube's ToS prohibit downloading videos, so, despite it's technically very easy, it's still not allowed. I wouldn't worry about SiteAdvisor if it wasn't for my provider who sends me emails about my site "maybe having issues"... – kralyk Aug 15 '14 at 11:35
  • 16
    @kralyk - if you don't want the public (which includes McAfee) looking at it, don't put it on the web. It's that simple. YOU CONTROL YOUR WEBSITE. No one is forcing you to put it out there, and if you don't want people to look at it, then DON'T put it out there. If you ARE going to put it out there, then don't be surprised that people (including people who want to sell you stuff) are looking at it. Stop trying to turn your desires into someone else's problem. – Michael Kohne Aug 15 '14 at 13:27
  • 1
    @MichaelKohne I actually don't have a problem with that, that's why I _do_ sometimes download videos from YouTube even though it's forbidden, etcetera. What I don't like is the double standard applied by these companies - they want you to respect _their_ desires about how _their_ websites/services/products are used, but when an "ordinary Joe" puts up a ToS suddenly it's somehow not worth respecting... – kralyk Aug 15 '14 at 13:41
  • 9
    @kralyk: seriously? You really think the issue here is a double standard? No person at McAfee either knows nor cares about your website. Nor should they. It would be absurd to expect anyone crawling the web to read everyone's ToS. That's why robot.txt was invented. – ToolmakerSteve Aug 15 '14 at 21:42
  • 3
    @kralyk Access to the resources in question must be gated in order for the ToS to be anywhere near meaningful. A robot crawling your unprotected pages is completely different from someone registering an account, acknowledging a ToS, and then feeding the credentials to a robot. – Andrew B Aug 15 '14 at 22:22
  • 4
    @kralyk - What sort of ***TOS*** do you have on your site that you feel McAfee is violating (not respecting)? – Kevin Fegan Aug 15 '14 at 22:26
  • 2
    I'm wondering if, eventually, this may fall under public photography. Anything you can see from the street, you can take a picture of without license. – Rob Aug 15 '14 at 23:19
  • @KevinFegan Let's suppose I have ToS similar to [YouTube's](https://www.youtube.com/t/terms) - it prohibits both downloading and crawling. Now, would it be legal for McAffee to crawl YouTube? (Note that I'm _not_ talking about whether it's technically possible.) – kralyk Aug 16 '14 at 10:54
  • 1
    @kralyk, Youtube publishes a robots.txt which says what cannot be crawled, which disallows almost everything. – Ben Aug 18 '14 at 10:18
49

There is legal precedent for this. Field v. Google Inc., 412 F. Supp. 2d 1106, (U.S. Dist. Ct. Nevada 2006). Google won a summary judgement based on several factors, most notably that the author did not utilize a robots.txt file in the metatags on his website, which would have prevented Google from crawling and caching pages the website owner did not want indexed.

Ruling pdf

There is NO U.S. law specifically dealing with robots.txt files; however another court case has set some precedent that could eventually lead to robots.txt files being considered as circumventing intentional electronic measures taken to protect content. In HEALTHCARE ADVOCATES, INC Vs HARDING, EARLEY, FOLLMER & FRAILEY, et. al, Healthcare Advocates argued that Harding et al essentially hacked the capabilities of the Wayback Machine in order to gain access to cached files of pages that had newer versions with robots.txt files. While Healthcare Advocates lost this case, the District Court noted that the problem was not that Harding et al "picked the lock," but that they gained access to the files because of a server-load problem with the Wayback Machine that granted access to the cached files when it shouldn't have and therefore there was "no lock to pick."

Court Ruling pdf

It is only a matter of time IMHO until someone takes this ruling and turns it on its side: The court indicated that robots.txt is a lock to prevent crawling and circumventing it is picking the lock.

Many of these lawsuits, unfortunately, aren't as simple as "I tried to tell your crawler that it is not allowed and your crawler ignored those settings/commands." There are a host of other issues in all these cases that ultimately affect the outcome more than that core issue of whether or not a robots.txt file should be considered electronic protection method under US DCMA law.

That having been said, this is a US law and someone from China can do what they want--not because of the legal issue, but because China won't enforce US trademark and copyright protection, so good luck going after them.

Not a short answer, but there really isn't a short, simple answer to your question!

jcanker
  • 606
  • 5
  • 2
  • 1
    This is a great answer, thanks. The thing I don't like about robots.txt is that it's not an actual standard (nevermind standard required by law). These companies can just simply ignore it. I don't like being in the position where they tell me _"You should put up a robots.txt file and maybe we won't crawl your website, but maybe we will, we do what we like."_ It would be great if there was a standard for specifying website's ToS in website's metadata. – kralyk Aug 15 '14 at 13:35
  • 5
    @jcanker Those two cases are about copyright infringement claims. In the behavior of crawlers that cache content, like those operated by Google and archive.org, it makes perfect sense that copyright issues come into play. But McAfee SiteAdvisor is not actually copying and storing (much less making publicly available) content from websites it accesses, is it? Though I'm not a lawyer, I think this distinction gives us reason to *very strongly doubt* that either case is in any way applicable to the behavior of a system like SiteAdvisor, *regardless* of whether or not it respects robots.txt. – Eliah Kagan Aug 15 '14 at 14:26
  • 12
    @kralyk - re "These companies can just simply ignore it.". Well, yes. That's the way the internet works. And even if it were somehow more fundamental, it would be trivial, absolutely trivial, for a crawler to pretend it was a human being accessing your web pages. You are asking for the technically **impossible**. Indeed, if you think through what you are asking, what you seek is not logical, it has no meaning. Except in a legal distinction. Your only possible protections are (1) hiding important content behind user login authentication, and (2) legal protection, as discussed in this answer. – ToolmakerSteve Aug 15 '14 at 21:34
  • @ToolmakerSteve I know it's technically impossible to ban robots completely. This is a different situation though - I'm not looking for a technical solution, I'm asking whether it's legal, also note that McAffee has informed me that they crawl my website, I don't need to detect it. – kralyk Aug 16 '14 at 10:59
  • There is also legal precedent the other way: [ebay v bidder's edge](http://en.wikipedia.org/wiki/EBay_v._Bidder%27s_Edge#Order) – John Aug 18 '14 at 12:21
  • Except if this crawling is a clear abuse by it's freauency, it's just like if you were building a house and ask : "does people have the right to look at my house"? Looking briefly is ok. Come in the garden to measure the walls is maybe problematic, but how can you avoid people to shoot your home with today's modern technology and mobile devices? – smonff Aug 19 '14 at 20:15
  • It seems to me that the robots.txt is kind of like a "No soliciting" sign. – Barmar Aug 19 '14 at 21:50
11

Whether this behaviour is ethical or not isn't perfectly clear cut.

The act of crawling a public site is, itself, not unethical (unless you've forbidden it explicitly using a robots.txt or other technological measures, and they're circumventing them).

What they are doing is the rough equivalent of cold calling you, while announcing to the world that you are possibly not secure. If that damages your reputation and is unjustified, it's unethical; if it does that and the only resolution for it involves you paying them, it's racketeering. But, I don't think this is what is going on.

The other time this becomes unethical is when someone crawls your site to appropriate your content or data and then represents it as their own. But, that too isn't what is going on.

So, I suggest that their behaviour in this case is ethical, and you can also most likely ignore it.

Their related behaviour of spamming you is unethical if you have no relationship with them and didn't request the emails, but I suspect they have a working unsubscribe.

Falcon Momot
  • 24,975
  • 13
  • 61
  • 92
  • 1
    I'm not sure I'd call a `Disallow` directive in a robots.txt file a "forbidding technological measure". robots.txt acts as a courtesy request, and while well-behaved bots will abide by it, there's no obligation and no real security involved. In fact, badly behaved bots might well take an entry in robots.txt as an invitation to crawl that specific path... – user Aug 15 '14 at 08:21
  • 2
    @MichaelKjörling, Only half agree. There is no real security but there is an obligation. It is a keep out sign, and your obligation is to keep out since you don't have permission to enter. – Ben Aug 18 '14 at 10:22
  • It's a "keep out" sign, without a lock. Try that on your home and see how much sympathy you get after the thieves come calling! (Actually, it's a "keep out" sign that explicitly lists the unlocked doors and windows that you want people to stay out of.) – Randy Orrison Aug 19 '14 at 10:14
2

Technical approach to blocking certain people or companies from accessing your web site:

You can block specific IP addresses, or ranges of addresses from accessing the pages of your site. This is in .htaccess file (if your site is running on Apache Web Server).

http://www.htaccess-guide.com/deny-visitors-by-ip-address/

Have your web server log IP addresses that it is accessed from, and look up those IP addresses, to find ones associated with McAfee. Probably easy to tell now, if you don't have any regular visitors.

Of course, they might change IP addresses in the future. Still, if you look up the IP addresses you find, to see who owns them, you might be able to learn about a whole block of addresses owned by McAfee, and block them all.


For a legal basis for doing so:

"Website owners can legally block some users, court rules"

http://www.computerworld.com/s/article/9241730/Website_owners_can_legally_block_some_users_court_rules

(If your website is a personal one, no one would contest your right to block some users. But if it is a website for a business, there are legal and moral arguments on both sides of that discussion. The smaller your business, the easier it is to be legally protected -- and the less anyone else would care enough to complain anyway.)


You might also be interested in "Deny visitors by referrer".

" If you've ever looked at your logs and noticed a surprising increase in traffic, yet no increases in actual file requests it's probably someone pinching content (such as CSS files) or someone attempting to hack your web site (this may simply mean trying to find non public content)."

http://www.htaccess-guide.com/deny-visitors-by-referrer/