Make wget not download files larger than X size

11

2

Okay, I give up. How do I size limit which files are downloaded, like say I don't want any files bigger than 2 MB?

Nathaniel

Posted 2010-03-18T00:45:33.703

Reputation: 3 966

Answers

6

The only limitation option I know which wget supports is the -Q switch for quota. This is not what you want though, as it will stop after a combined limit of all files you've downloaded, not individually. Piping each link to it seperately with the -Q switch won't work either, as explained in the man page.

I don't know what environment you're using, but crawler supports file size limitations with max-length-bytes and runs on the Java platform.

from their user manual:

  • max-length-bytes

Maximum number of bytes to download per document. Will truncate file once this limit is reached.

By default this value is set to an extremely large value (in the exabyte range) that will never be reached in practice.

John T

Posted 2010-03-18T00:45:33.703

Reputation: 149 037

Hmmm. Okay. That reiterated a lot of what I found out but good answer anyway. I didn't know that Heritrix truncated files instead of skipping them, though. – Nathaniel – 2010-03-18T22:28:47.203

3

If its about "downloading 2MB max" rather than "download files with max 2MB" you could just limit the output saved to disk.

wget -O - $url |head -c 1024 (with an optional > $SaveAsFile) -> saves the first KB and the rest gets truncated.

(enough to see a "OK:$Message", not killing my /tmp with tons of error messages from the remote ;-))

Tabakhase

Posted 2010-03-18T00:45:33.703

Reputation: 131

1

This possible with help of 3rd-party patches: http://yurichev.com/wget.html

Dennis Yurichev

Posted 2010-03-18T00:45:33.703

Reputation: 131

@KronoS there's an "edit" button right there if you think the answer needs to be expanded. Personally it seems fine as-is, given that sentence #1 of the linked page explains the new option… – supervacuo – 2015-06-27T19:39:57.333

Review this post, you must.

– James Mertz – 2013-04-03T22:18:45.377