wget - only getting .listing file in every sub-dir

4

1

if I use the command

wget --no-remove-listing -P ...../debugdir/gnu/<dir>/ ftp:<ftp-site>/gnu/<dir>/

I will get the .listing file of that directory. But I have to step through each subsequent sub-directories to get the whole structure. Is there a way to get the .listing file from all (sub)directories with one command?

Also, I have noticed that the file index.html is automatically generated after every access. Is there a way to suppress this behavior?

The thing is that I always found Bash processing slow, but after some profiling I found that the largest delay is in getting each .listing file from subsequent sub-directories.

Example: checking for specific file extensions in the GNU tree takes about 320 seconds of which 290 seconds are for processing the above wget command.

Frans

Posted 2012-05-11T21:22:35.853

Reputation: 41

Answers

5

If you are looking to build an index of a FTP site, that is, to list all of the subdirectories and files on the site without actually retrieving them, you can do this:

wget -r -x --no-remove-listing --spider ftp://ftp.example.com/

where,

  • -r => recursive (i.e, visit subdirectories)
  • -x => force mirror subdirectories to be created on client
  • --no-remove-listing => leave ".listing" files in each subdirectory
  • --spider => visit but do not retrieve files

This will create a sparse directory tree of identical structure on the client as the server, containing only ".listing" files showing the contents (the result of "ls -l") for each directory. If you want to digest that into a single list of path-qualified file names (like you would get from "find . -type f"), then do this at the root of that sparse directory tree:

find . -type f -exec dos2unix {} \;
( find . -maxdepth 999 -name .listing -exec \
awk '$1 !~ /^d/ {C="date +\"%Y-%m-%d %H:%M:%S\" -d \"" $6 " " $7 " " $8 "\""; \
C | getline D; printf "%s\t%12d\t%s%s\n", D, $5, gensub(/[^/]*$/,"","g",FILENAME), $9}' \
{} \; 2>/dev/null ) | sort -k4

which will give you output like

2000-09-27 00:00:00       261149    ./README
2000-08-31 00:00:00       727040    ./foo.txt
2000-10-02 00:00:00      1031115    ./subdir/bar.txt
2000-11-02 00:00:00      1440830    ./anotherdir/blat.txt

NB: the "-maxdepth 999" option is not necessary in this use case, I left it in the invocation that I was testing that had an additional constraint: to limit the depth of the tree that was reported. For example, if you scan a site that contains full source trees for several projects, like

./foo/Makefile
./foo/src/...
./foo/test/...
./bar/Makefile
./bar/src/...
./bar/test/...

then you might only want an outline of the projects and top level directories. In this case, you would give an option like "-maxdepth 2".

Codex24

Posted 2012-05-11T21:22:35.853

Reputation: 265