How to stop cURL from writing over downloaded files



I'm using

$ xargs -n 1 curl -O < gwurls.txt

to grab a long list of files. Unfortunately, the site I'm grabbing from relies on the path to provide uniqueness, so -O doesn't know the difference between a/1.pdf and b/1.pdf and clobbers the file.

Is there a simple way around this?


Posted 2013-03-21T22:20:55.067

Reputation: 113



A couple of approaches:

  • Do umask 222 (or umask 277, if your umask is currently 77; i.e., add 200 to your umask).  This will cause all files that you create to be protected r--(whatever) instead of rw-(whatever), so, once you’ve created a file, you shouldn’t be able to overwrite it without chmoding it first (unless you’re running as root).  This answers the question you posed in your title, but it doesn’t really solve your problem; it just means that you’ll successfully download and retain a/1.pdf and miss out on b/1.pdf, rather than the other way around.  (If it’s any consolation, you’ll get error messages alerting you to the collisions.)
  • The problem seems to lie in your gwurls.txt file, which naïvely lists both a/1.pdf and b/1.pdf, so try to fix it there.  Mangle it with sed or something to look like
  a/1.pdf    a_1.pdf
  b/1.pdf    b_1.pdf

… and then write a script that runs curl with a URL of $1 and an output specification of $2, and run

  xargs -n 2your_script< modified_gwurls.txt

so xargs will run

your_script  a/1.pdf  a_1.pdf
your_script  b/1.pdf  b_1.pdf

This gets messy if any of the filenames have whitespace in them –– but I guess that isn’t possible for URLs, is it?


Posted 2013-03-21T22:20:55.067

Reputation: 17 653

I was completely focused on the curl call and avoided the obvious, which was modifying the scrape. Thanks for the perspective. – PHPeer – 2013-03-26T22:22:32.027



The by far easiest solution would be installing Wget and executing the following command:

wget --input-file=gwurls.txt

Wget automatically renames the output file if a file of the same name already exists.


If you strip scheme and host (e.g., from the URL, you can replace all slashes with underscores (or any other character) and save the files like that. To be on the safe side, you could replace pre-exisiting underscores with double underscores.

With bash, this should work:

while read -r URL; do

    curl --output "$OUTPUT" --url "$URL"
done < gwurls.txt

How it works:

  • while read -r URL; do ... done < gwurls.txt reads the contents of gwurls.txt line by line and stores the entire line (without leading or trailing spaces) into the variable URL and executes ....

  • The three OUTPUT=... commands perform the mentioned replacements using bash string manipulation.

  • curl --output "$OUTPUT" --url "URL" downloads the file and stores it with the desired filename.

Directory structure

It's also possible to re-create the directory structure of the server using a similar approach.

With bash, this should work:

while read -r URL; do

    curl --create-dirs --output "$OUTPUT" --url "$URL"
done < gwurls.txt

Here, the --create-dirs switch makes cURL create the directory a if OUTPUT reads a/1.pdf.


Posted 2013-03-21T22:20:55.067

Reputation: 42 934

Agree, wget would've been the best approach, but needed a curl solution. Why cURL doesn't have auto-renaming built-in is curious. – PHPeer – 2013-03-26T22:31:04.470