On CMS type systems, there are no directories and subdirectories, only routes that correspond to informational nodes/IDs that are assigned to the information you are requesting. These routes are dynamically created depending on the categorization method used to access that information (newest posts, categories, tags, brand lists, and any other presentational categorization method the site owner may use to help you find the end node)
Therefore the information you are looking for may be represented by multiple variations dependent on the route used to access the end node (virtual page)
To keep the website owner happy by not overloading his server, make like Google and look for the sitemap.xml file. If the site owner is following best practice, it will be a full listing of the canonical web pages available on the website which means you only have to access the end virtual page once, not ending up downloading multiple copies of the same thing.
Do you look for physical files? Or want to crawl virtual URL-s? – karatedog – 2011-12-10T22:35:07.073
@karatedog, i'm looking for actual files. – Mark – 2011-12-11T01:24:34.310
then it is easy. If directory listing is enabled on the target site (and you have proper rights), you can discover it like it was a filesystem. If directory listing is turned off, then you can't. In an Apache config that's what
Indexes
do. – karatedog – 2011-12-12T12:05:00.410