How can I make wget rename downloaded files to not include the query string?

33

7

I'm downloading a site with wget and a lot of the links have queries attached to them, so when I do this:

wget -nv -c -r -H -A mp3 -nd http://url.to.old.podcasts.com/

I end up with a lot of files like this:

1.mp3?foo=bar
2.mp3?blatz=pow
3.mp3?fizz=buzz

What I'd like to end up with is:

1.mp3
2.mp3
3.mp3

This is all taking place in ubuntu linux and I've got wget 1.10.2.

I know I can do this after I get everything via a script to rename everything. However I'd really like a solution from within wget so I can see the correct names as the download is happening.

Can anyone help me unravel this?

Keith Twombley

Posted 2009-10-26T19:02:57.113

Reputation: 552

Post your question at www.stackoverflow.com. – Deniz Zoeteman – 2009-10-26T19:42:24.867

3@TutorialPoint why? question is looking for a within-wget-way-to-do-it, SO would just migrate it back here. – quack quixote – 2009-10-26T19:57:44.100

Well, there is no within-wget-way-to-do-it – ayrnieu – 2009-10-26T20:32:45.497

1@ayrnieu: not in one command, no. and not without a helper. but you can certainly do it with as few as n+1 wget commands (if not fewer). – quack quixote – 2009-10-26T20:36:52.773

Answers

24

If the server is kind, it might be sticking a Content-Disposition header on the download advising your client of the correct filename. Telling wget to listen to that header for the final filename is as simple as:

wget --content-disposition

You'll need a newish version of wget to use this feature.

I have no idea how well it handles a server claiming a filename of '/etc/passwd'.

Filox

Posted 2009-10-26T19:02:57.113

Reputation: 241

I have no problem with this answer, as it no doubt works for some situations. Unfortunately, it didn't work for me with respect to some cloudfront-served pages with ?v=blah type versioning in them. There may be some cloudfront-specific way to request a document without these, I don't know, but I failed to find one, so something like one of the other answers may well be necessary in such a case. (If anyone knows of a way to strip - or get Cloudfront not to serve - the v= strings, I'd love to hear about it.)

– lindes – 2019-04-16T17:50:19.783

18

I realized after processing a large batch that I should have instructed wget to ignore the query strings. I did not want to do it over again so I made this script which worked for me:

# /bin/bash
for i in `find $1 -type f`
do
    mv $i `echo $i | cut -d? -f1`
done

Put that in a file like rmqstr and chmod +x rmqstr Syntax: ./rmqstr <directory (defaults to .)>

It will remove the query strings from all filenames recursively.

Gregory Wolf

Posted 2009-10-26T19:02:57.113

Reputation: 326

2I would add -name "*\?*" to find part to limit only to needed files :) – Arkadiusz 'flies' Rzadkowolski – 2019-03-03T17:30:45.137

4

I think, in order to get wget to save as a filename different than the URL specifies, you need to use the -O filename argument. That only does what you want when you give it a single URL -- with multiple URLs, all downloaded content ends up in filename.

But that's really the answer. Instead of trying to do it all in one wget command, use multiple commands. Now your workflow becomes:

  1. Run wget to get the base HTML file(s) containing your links;
  2. Parse for URLs;
  3. Foreach URL ending in mp3,
    1. process URL to get a filename (eg turn http://foo/bar/baz.mp3?gargle=blaster into baz.mp3
    2. (optional) check that filename doesn't exist
    3. run wget <URL> -O <filename>

That solves your problem, but now you need to figure out how to grab the base files to find your mp3 URLs.

Do you have a particular site/base URL in mind? Steps 1 and 3 will be easier to handle with a concrete example.

quack quixote

Posted 2009-10-26T19:02:57.113

Reputation: 37 382

1

I have a similar approach as @Gregory Wolf because his code always created error messages like this:

mv: './file' and './file' are the same file

Thus I first check if there is a query string in the filename before moving the file:

for f in $(find $1 -type f); do
    if [ $f = ${f%%\?*} ]; then continue; fi
    mv "${f}" "${f%%\?*}"
done

This will recursively check every file and remove all query strings in their filenames if available.

KittMedia

Posted 2009-10-26T19:02:57.113

Reputation: 111

1

so I can see the correct names as the download is happening.

OK. Use wget as you normally do; use the post-wget script that you normally use, but process wget's output so that it's easier on the eyes:

#! /bin/sh
exec wget --progress=bar:force $* 2>&1 | \
  perl -pe 'BEGIN { $| = 1 } s,(?<=`)([^\x27?]+),\e[36;1m$1\e[0m, if /^Saving/'
cgi-cut # rename files

This will still show the ?foo=bar as you download, but will display the rest of the name in bright cyan.

ayrnieu

Posted 2009-10-26T19:02:57.113

Reputation: 279

This somewhat solves the issue of the filenames being displayed, but the OP also wants the final file name not to have the query string. – Michael Mior – 2014-08-16T11:56:14.917

0

Look at these two commands I created to clone a site, and after clone is done, you can execute second command.

The second command will take a look in entire clone, search for "?" file pattern names, and will remove query string from the file name.

# Clone entire site.
    wget --content-disposition --execute robots=off --recursive --no-parent --continue --no-clobber http://example.com

# Remove query string from a static resource.
for i in `find $1 -type f -name "*\?*"`; do mv $i `echo $i | cut -d? -f1`; done

(See it in GitHub Gist.)

Vijay Padhariya

Posted 2009-10-26T19:02:57.113

Reputation: 1

-2

Even easier is this: https://unix.stackexchange.com/questions/196253/how-do-you-rename-files-specifically-in-a-list-that-wget-will-use

This suggests a method that essentially uses wget's rename function (can be altered to include directory) for multiple files. See the second version proposed.

robcore

Posted 2009-10-26T19:02:57.113

Reputation: 1

2Can you please quote the relevant information from the link, so we know which material, you believe answers this question. – Ramhound – 2016-01-21T14:28:55.597