-1

How does one with simple access rights find out the remote host directory structure? I see that web archives are able to not only backup content on the specified domain, but every folder within it.

Philipp
  • 48,867
  • 8
  • 127
  • 157
Unihedron
  • 151
  • 1
  • 11

1 Answers1

3

Web archiving tools are usually so-called "Spiders". They start with a HTML document and then follow every link they find on it, then search for links in these documents and so on. That way they are usually able to find any file on the webserver which is linked from somewhere on the same domain. The <a rel="nofollow" attribute on a hyperlink is supposed to prevent spiders from following links, but keep in mind that it is only a request the spider doesn't necessarily follows (The webspider I wrote once didn't, just because I was too lazy to implement it).

Search engines don't just spider a single domain but also follow links leading to other domains, so they sometimes manage to find files on domains which are not linked anywhere from that domain, but are linked from other domains. This can be prevented by using a robots.txt to deny indexing of directories you don't want to appear in search engines. Again, this is just a polite request for search engines, not an effective security measure. However, most search engines will respect it.

Some webservers are configured in a way that when a visitor requests a directory, the webserver generates a listing of all files and directories in there. However, this becomes quite rare because it is usually not what the webmaster wants. Nowadays most webservers are configured out-of-the-box to return 403 or 404 instead of a directory listing.

When a file is not linked from anywhere explicitly and the webserver doesn't intentionally provide some way to list directory contents, the only way is to guess filenames. Some penetration testing tools will automatically guess names of files which might be interesting to an attacker (like /wp-config.bak in case the webmaster did a backup of their wordpress configuration and forgot to protect it against public access). But brute-forcing every possible filename is too slow to do it online, so that method won't get you every single file either.

Philipp
  • 48,867
  • 8
  • 127
  • 157