-1

I searched the internet like this inurl:ftp -inurl:(http|https). I found many FTP hosts. I can add or remove files from some hosts.
How does Google get information about FTP servers? How can one avoid indexing of one's FTP server(s) in Google?

S.L. Barth
  • 5,486
  • 8
  • 38
  • 47
open source guy
  • 1,909
  • 9
  • 25
  • 27

3 Answers3

4

Google apparently scans new domain names, and infers from a name like www.example.com or ftp.example.com that there may be an HTTP or FTP server responding there, and thus worth indexing. They will also follow links discovered in other Web pages; this domain-based indexation is used by Google to explore and reference sites which have not been linked from other sites (yet).

To prevent indexation of your FTP, you can:

  • Put a robots.txt file in your server. See this page for details. Most Web crawlers will honour such a file in an HTTP server; Google also looks for it in FTP servers (but Google claims such support to be "Google-specific").
  • Disable anonymous login. Instead, enforce use of a specific login+password pair; you can then publish the login and password on an explanatory Web page. Google's robot will not be able to "understand" that Web page and won't go beyond anonymous login.
Tom Leek
  • 168,808
  • 28
  • 337
  • 475
  • 1
    I can confirm independently and from my server logs that this is indeed correct. Also worth pointing out (in case it wasn't clear from Tom's answer) is that Google does not resort to URL/port probing as it was recently discovered [Bing does and excuses that as 'beta'](http://webmasters.stackexchange.com/questions/44257/strange-request-from-bingbot-for-trafficbasedsspsitemap-xml). A `robots.txt` with `Disallow: ftp://*` following the `User-agent: *` in HTTP root and `Disallow: /` in FTP root will stop Google from crawling your FTP server. Should stop other crawlers respecting `robots.txt`, too. – TildalWave Mar 06 '13 at 08:08
0

Google indexes FTP servers in exactly the same way that they index web servers. For details try something like http://lmgtfy.com/?q=How+do+search+engines+work%3F&l=1#

Basically, they start with a bunch of popular web pages, and follow all the links in them (which will include FTP servers) and then follow all the links in those and so on and so on.

There is a standard way to request that search engines and such do not index your site, using a file called robots.txt. A good source of information about this mechanism is http://www.robotstxt.org.

Graham Hill
  • 15,394
  • 37
  • 62
0

AFAIK, Google reaches servers whether they are HTTP or FTP by using a crawler. So if a website refers to an FTP server, the content will be indexed.

S.L. Barth
  • 5,486
  • 8
  • 38
  • 47
elsadek
  • 1,782
  • 2
  • 17
  • 53