(How) Can outsiders discover the pages that are being hosted on my server?

Question

I have a web site hosted from my server. Sometimes, I upload database manipulation scripts to a folder which is three levels deep in the website and run them using my web browser. These scripts should not be accessed by outside users and I remove them within hours of uploading them. Is there a risk that these scripts will be found or crawled if no other page links to them? If so, then how can they be discovered?

I also have a test sub-domain located at user.mysite.com. Is it possible for outsiders that do not know the sub domain to discover the existence of the sub domain?

score 25 · Accepted Answer · answered Jul 14 '14 at 19:22

Your "secret files" remain secret exactly as long as their names (with full path) remain secret. You may consider the path as a kind of password. Note that the paths will leak to various places (proxy, Web server logs, history of your browser...). If the files are important and sensitive, you should just do things properly:

Use SSL for upload and access to these files.
Setup an access password for the directory where the files are.

That way, you are back to known waters: you have a (part of) Web site with sensitive data and protected by a password. Make it strong, and you are all set.

In the case of the sub-domain: that "sub-domain" is advertised to the World at large through the DNS. It is possible to configure DNS servers so that outsiders cannot easily enumerate all sub-domains of a domain, but this takes some care. Moreover, whenever you access that sub-domain, your machine will use DNS queries (for the corresponding IP address); these queries travel without any particular protection, and contain the sub-domain name. Thus, this is an easy prey to passive eavesdropper (i.e. "people connected to the same WiFi access point as you"). It would be overly optimistic to believe in the secrecy of a sub-domain.

And just like passwords, paths can be brute-forced. Crawlers use dictionary lists to try common directory names and try to map out an entire website, even if there are no links to those locations. — schroeder, Jul 14 '14 at 19:59

score 5 · Answer 2 · answered Jul 14 '14 at 19:30

5

I see four possibilities to path leak

1) bruteforce

2) malware on your host

3) accident =) you can share this path to someone or forgot to delete, or link this from some place by accident.

4) google chrome =) because google use information from chrome (and probably ff) to feed crawler

same thing is about dns. Relying on path is bad practice.

answered Jul 14 '14 at 19:30

Evgenii Gostiukhin

151
2

14

*“because google use information from chrome (and probably ff) to feed crawler”* [citation needed] – Ry- Jul 14 '14 at 19:35
4

I agree with minitech... It sounds like #4 might easily be followed up with "5) you didn't strap down your tinfoil hat." Unless you can give a citation for Chrome being used on the crawler, don't suggest it is the case. – apnorton Jul 14 '14 at 20:24
5

This is common knowledge but OK, here's your citation. https://support.google.com/websearch/answer/106230 "(About autocomplete predictions) ... all of the predictions that are shown in the drop-down list have been typed before by Google users or appear on the web." So yes, things you type into the omnibox are aggregated into google's suggestions. The autocomplete FAQ is linked to in the documentation for the Google Chrome Omnibox here https://support.google.com/chrome/answer/95656?hl=en&ref_topic=14676. But they exclude porn, so if you dont want your site indexed, put a naked lady in it. – Wug Jul 14 '14 at 22:14
5

@Wug "Google users" != "Google Chrome users", so that doesn't fully back up your claim. It is probable that they gather URLs submitted by browsers to autocomplete and anti-phishing features though. I have reason to suspect them of harvesting URLs out of messages sent from or to GMail as well, but no hard evidence or citation. – IMSoP Jul 14 '14 at 23:47
So, you disagree with me, but you agree with what I said? I'm confused. – Wug Jul 15 '14 at 02:50
At least they do harvest URLs out of Google Hangouts - each URL posted there points to google, which redirects to the real target. – Jost Jul 15 '14 at 09:45
Even if there is no/shaky evidence to support the idea that Google harvests and analyzes all info that they are exposed to, It is still good to note that they (and their employees) ARE exposed to a your information in a wide variety of ways. – Hoytman Jul 15 '14 at 12:26

score 4 · Answer 3 · answered Jul 15 '14 at 09:34

There is another way leaking information about "secret" web pages: When the page calls other material (web pages, but also javascript or style sheets) the referer header points back to that page.

A common scenario is loading the popular JQuery.js directly from code.google.com, leaking a web page to Google search.

Access statistics are exposed to Google in this case, too.

Note that this kind of leakage cannot be circumvented by requiring https. Password protection helps in so far, as only the name and path of the secret page, but not its content is exposed.

Keep everything local.

score 0 · Answer 4 · answered Jul 15 '14 at 05:12

First question regarding the administration pages: If you delete the webpages after you are done using them, then no one will be able to find them. However, if you leave them on the server, then they can be found. There are website scanning tools that can scan your website using a dictionary to find "hidden" resources.

In order to better protect these files better, I would add directory access controls. If you're using Apache, check out .htaccess. I would also configure the robots.txt file to instruct search engines not to crawl the areas you want to keep hidden.

Second question regarding the sub-domain: It is very easy to discover sub-domains for websites. Attackers can try to get your DNS servers to give up the information, or they can simply use Google to uncover these domains. For example, say you want to find all the subdomains for yahoo.com, try the Google search "site:yahoo.com -www". All the sub-domains that Google has crawled are listed in the search results. A quick Python script to parse these out and voila!

(How) Can outsiders discover the pages that are being hosted on my server?

4 Answers4

Linked