Is it possible to discover all the files and sub-directories of a URL?

I wonder if there is a software I can use to discover all the files and sub-directories given a URL?

For example, given www.some-website.com/some-directory/, I would like to find all the files in /some-directory/ directory as well as all sub-directories (and their files) in /some-directory/.

This would be for the HTTP protocol.

Mark

Posted 2011-12-10T14:34:59.463

Reputation: 33

Do you look for physical files? Or want to crawl virtual URL-s? – karatedog – 2011-12-10T22:35:07.073

@karatedog, i'm looking for actual files. – Mark – 2011-12-11T01:24:34.310

then it is easy. If directory listing is enabled on the target site (and you have proper rights), you can discover it like it was a filesystem. If directory listing is turned off, then you can't. In an Apache config that's what Indexes do. – karatedog – 2011-12-12T12:05:00.410

Answers

On CMS type systems, there are no directories and subdirectories, only routes that correspond to informational nodes/IDs that are assigned to the information you are requesting. These routes are dynamically created depending on the categorization method used to access that information (newest posts, categories, tags, brand lists, and any other presentational categorization method the site owner may use to help you find the end node)

Therefore the information you are looking for may be represented by multiple variations dependent on the route used to access the end node (virtual page)

To keep the website owner happy by not overloading his server, make like Google and look for the sitemap.xml file. If the site owner is following best practice, it will be a full listing of the canonical web pages available on the website which means you only have to access the end virtual page once, not ending up downloading multiple copies of the same thing.

Fiasco Labs

Posted 2011-12-10T14:34:59.463

Reputation: 6 368

It depends on how the server on the site you want to crawl is set up. The URL is not always proportional to the physical directory where files are located.

Normally, if an index file is not created on a server directory, the server will return directory contents. If an index file is created, it's almost impossible to fetch directory contents directly.

However you may use a site crawler like Internet Download Manager to crawl a website by its links which are provided in a site's HTML content. IDM retrieves all the HTML/images/multimedia content/txt/PDF files on a website for you.

Be sure to check their Terms of service before crawling.

afterburner

Posted 2011-12-10T14:34:59.463

Reputation: 158

You can ask the server with a ls command ...

– ragnq – 2011-12-10T14:57:53.303

ls is a linux command and is only available if you have root access to the server – afterburner – 2011-12-10T15:00:52.837

and ftp, and may work if its ssh, telnet, ... + unix – ragnq – 2011-12-10T15:02:39.897

@ragnq: It seems quite obvious to me that the OP doesn't own the server. – cYrus – 2011-12-10T15:28:40.620

@ragnq, cYrus and drfanai are right, I don't own the server so I cannot execute those commands. thanks for the input anyway. – Mark – 2011-12-10T15:53:59.280

Directory Listing should be enabled on that server, that's all. – karatedog – 2011-12-10T22:34:31.163

wget does this, if your on *nix. Its free and open source. You can probably get it for windows too, though I'm not sure.

Of course, the limitations are the same as mentioned above. Most websites nowadays don't have URLS that map directly to directory structures, but you can effectively mirror an entire site with wget. That is, you can download all the spaces on the site which are publicly available and hyperlinked from a page that you can reach.

Many sites will block you if they detect an unauthorized crawler mirroring their site, too fast. So you may need to be polite - have the crawling program only pull down a few pages per second.

Jones

Posted 2011-12-10T14:34:59.463

Reputation: 161