Odd search engine entries

Question

A customer found about a dozen valid URLs pointing to existing customer related documents at Yahoo. These URLs were not public and certainly not searchable at the customer's site. The documents have hard to guess names like https://site/dir/hardtoguessname.pdf; according to Burp's sequencer the entropy of hardtoguessname is estimated to be more than 100 bits - that should be good enough to prevent plain guessing.

The whole affair is odd for two reasons: First, there are regulary hundreds or thousands of these documents there - why where just those few indexed? Second, these URLs were indexed only by Yahoo but neither by Google nor by Bing.

I don't think that those URLs were indexed by ordinary crawling. Is it possible that a user could by chance got those URLs indexed, say, by using the Yahoo tool bar or by using Yahoo mail?

Maybe someone had a Yahoo Toolbar or similar that reports visited URLs to them so they can index them, or he used a Yahoo service to send the URL (Yahoo mail?) and they automatically add any URL they see to the index? — André Borie, Aug 16 '16 at 15:42
@drewbenn: I have no Yahoo account. However, the URLs in question (which are disabled by now) were publicly visible at Yahoo. — countermode, Aug 16 '16 at 23:24
@André Borie: That's what I thought. I am sure that Yahoo didn't learn about those URLs by plain crawling, or they had found way more of them. — countermode, Aug 16 '16 at 23:30

Julie Pelletier · Answer 1 · 2016-08-16T15:50:44.263

4

Using a hard to guess name is in no way a proper method to prevent search engine indexing. While it may sometimes work, it is the least reliable and effective way to do it.

You should instead use the officially supported method which is placing a robots.txt file in your web root detailing which files and directories should be indexed. This is supported by all major search engines including Google, Bing, Yahoo, AOL, etc.

The syntax is pretty straight-forward. A simple example to prevent indexing the private directory is:

User-agent: *
Disallow: /private/

See the relevant Wikipedia page on robots.txt for more details.

Note that private documents should not be accessible directly either way. Your website should use some form of authentication system to control what the user may access and private information or documents should not be accessible from inside the web root.

edited Aug 16 '16 at 15:50

answered Aug 16 '16 at 15:31

Julie Pelletier

1,919
10
18

The caveat here is as a pentester, the robots.txt is the first file I'm going to look at to see what you don't want me to see. Password protecting the data or folder is the better approach imo to ensure it's not indexed. – DKNUCKLES Aug 16 '16 at 15:45
True, but that is irrelevant to search engines. Of course, private documents should be protected and limited to logged in authorized users. It's a good point though and I'll add a note about it. – Julie Pelletier Aug 16 '16 at 15:48
Note that [robots.txt can prevent *crawling*, not *indexing*](http://stackoverflow.com/a/35657571/1591669). A search engine might still index a URL which it’s not allowed to crawl. – unor Aug 18 '16 at 16:34

Odd search engine entries

1 Answers1