How does Google get information about FTP servers - and how to avoid it?

Question

I searched the internet like this inurl:ftp -inurl:(http|https). I found many FTP hosts. I can add or remove files from some hosts.
How does Google get information about FTP servers? How can one avoid indexing of one's FTP server(s) in Google?

This is not a security question, rather this question concerns how Google's web spidering process works. This question would be better suited for webmaster.se or plain old SO. — grauwulf, Mar 05 '13 at 15:59
@messi, If you pop over to the Chat, or the Meta area (links at the top of this website), then you can get some advice from there on making good questions, and what to do about bad ones. — 700 Software, Mar 05 '13 at 16:20

score 4 · Answer 1 · answered Mar 05 '13 at 15:12

Google apparently scans new domain names, and infers from a name like www.example.com or ftp.example.com that there may be an HTTP or FTP server responding there, and thus worth indexing. They will also follow links discovered in other Web pages; this domain-based indexation is used by Google to explore and reference sites which have not been linked from other sites (yet).

To prevent indexation of your FTP, you can:

Put a robots.txt file in your server. See this page for details. Most Web crawlers will honour such a file in an HTTP server; Google also looks for it in FTP servers (but Google claims such support to be "Google-specific").
Disable anonymous login. Instead, enforce use of a specific login+password pair; you can then publish the login and password on an explanatory Web page. Google's robot will not be able to "understand" that Web page and won't go beyond anonymous login.

I can confirm independently and from my server logs that this is indeed correct. Also worth pointing out (in case it wasn't clear from Tom's answer) is that Google does not resort to URL/port probing as it was recently discovered [Bing does and excuses that as 'beta'](http://webmasters.stackexchange.com/questions/44257/strange-request-from-bingbot-for-trafficbasedsspsitemap-xml). A `robots.txt` with `Disallow: ftp://*` following the `User-agent: *` in HTTP root and `Disallow: /` in FTP root will stop Google from crawling your FTP server. Should stop other crawlers respecting `robots.txt`, too. — TildalWave, Mar 06 '13 at 08:08

score 0 · Answer 2 · answered Mar 05 '13 at 15:10

Google indexes FTP servers in exactly the same way that they index web servers. For details try something like http://lmgtfy.com/?q=How+do+search+engines+work%3F&l=1#

Basically, they start with a bunch of popular web pages, and follow all the links in them (which will include FTP servers) and then follow all the links in those and so on and so on.

There is a standard way to request that search engines and such do not index your site, using a file called robots.txt. A good source of information about this mechanism is http://www.robotstxt.org.

score 0 · Answer 3 · edited Mar 06 '13 at 06:21

0

AFAIK, Google reaches servers whether they are HTTP or FTP by using a crawler. So if a website refers to an FTP server, the content will be indexed.

edited Mar 06 '13 at 06:21

S.L. Barth

5,486
8
38
47

answered Mar 05 '13 at 15:42

elsadek

1,782
2
17
53

How does Google get information about FTP servers - and how to avoid it?

3 Answers3