making sense of wget -r output

1

This is the output of tree command in one directory:

.
|-- asdf.txt
|-- asd.txt
|-- fabc
|   |-- fbca
|   `-- file1.txt
|-- fldr1
|-- fldr2
|   `-- index.html
|-- fldr3
|   |-- cap.txt
|   `-- f01
`-- out.txt

6 directories, 6 files

I start a local http server in this directory. Next I run the following command:

wget -r -nv --spider --no-parent http://localhost:3000 -o -

...and get the following output:

2017-01-02 20:07:24 URL:http://localhost:3000/ [1580] -> "localhost:3000/index.html" [1]
http://localhost:3000/robots.txt:
2017-01-02 20:07:24 ERROR 404: Not Found.
2017-01-02 20:07:24 URL:http://localhost:3000/fabc/ [897] -> "localhost:3000/fabc/index.html" [1]
2017-01-02 20:07:24 URL:http://localhost:3000/fldr1/ [536] -> "localhost:3000/fldr1/index.html" [1]
2017-01-02 20:07:24 URL:http://localhost:3000/fldr2/ [0/0] -> "localhost:3000/fldr2/index.html" [1]
2017-01-02 20:07:24 URL:http://localhost:3000/fldr3/ [896] -> "localhost:3000/fldr3/index.html" [1]
2017-01-02 20:07:24 URL: http://localhost:3000/asd.txt 200 OK
unlink: No such file or directory
2017-01-02 20:07:24 URL: http://localhost:3000/asdf.txt 200 OK
unlink: No such file or directory
2017-01-02 20:07:24 URL: http://localhost:3000/out.txt 200 OK
unlink: No such file or directory
2017-01-02 20:07:24 URL:http://localhost:3000/fabc/fbca/ [548] -> "localhost:3000/fabc/fbca/index.html" [1]
2017-01-02 20:07:24 URL: http://localhost:3000/fabc/file1.txt 200 OK
unlink: No such file or directory
2017-01-02 20:07:24 URL:http://localhost:3000/fldr3/f01/ [548] -> "localhost:3000/fldr3/f01/index.html" [1]
2017-01-02 20:07:24 URL: http://localhost:3000/fldr3/cap.txt 200 OK
unlink: No such file or directory
Found no broken links.

FINISHED --2017-01-02 20:07:24--
Total wall clock time: 0.3s
Downloaded: 7 files, 4.9K in 0s (43.4 MB/s)
  1. Is wget written to always seek index.html? Can we disable this?
  2. What are those numbers such as 1580, 536, 0/0, etc?
  3. Why does it say unlink: No such file or directory?

deostroll

Posted 2017-01-02T14:42:17.127

Reputation: 1 515

Answers

3

  1. You can try to skip over files with --reject option (accepts wildcards as well):

    wget --reject index.html

However you don't want to do this. When using wget with -r, it somehow needs to get a list of files inside the directory. Thus wget asks for the index.html file and parses the content in hope to get paths to other files in this directory. When there is no index.html file in the folder, the webserver will usually generate it for wget - this file will contain the directory listing. The creation of this list file has to be enabled on the webserver - otherwise wget will receive a HTTP 404 reply and fail with the recursive download.

  1. This is the file size in bytes.
  2. This means that a file couldn't be removed (probably because it wasn't created in first place). Do you have write permission on the directory into which you download with wget?

Edit: After testing wget downloads with --spider and --recursive I've reproduced your unlink error. It seems that wget uses the content type of response to determine if the file can contain links to other resources. If content type test fails and file is not downloaded, wget will still try to remove the temporary file, as if it was downloaded (This is apparent when rerunning wget with --debug. It will clearly state Removing file due to --spider in recursive_retrieve():). I guess you've found a bug in wget.

Marek Rost

Posted 2017-01-02T14:42:17.127

Reputation: 1 826

Okay, so what is 0/0 then? (In response to answer 2) – deostroll – 2017-01-02T19:27:55.170

Looks like error when downloading the file - for example receiving HTTP 200 OK from web server while no file is provided (due to incorrect permissions, misconfiguration... etc.). Did wget download the file contents or is the file empty? I'm afraid no one can tell you cause just from the zero file size. Here is someone facing similar issue: http://unix.stackexchange.com/q/91785 (answers suggest enable wget debugging option).

– Marek Rost – 2017-01-03T23:17:16.217

I ran it with the --spider option...now does that specifically mean anything? – deostroll – 2017-01-04T05:18:18.493

1spider just means "do not download files". With recursive this will change to "temporarily download files that can contain links to other resources". As mentioned in updated answer, if file should be downloaded is determined by its content type. – Marek Rost – 2017-01-06T22:36:15.983