8

I'm trying to create a static mirror of a php application (an old php Gallery installation, specifically). The app produces URLs such as:

view_album.php?set_albumName=MyAlbum

wget downloads these directly to files named the same, complete with question marks. In order to not break inbound links, I'd like to keep those names. But how do I serve them? I'm running into two problems:

  1. Webservers (correctly) attempt to find "view_album.php", and pass it the query args, rather than a finding a file with a question mark in it. How do I tell a webserver to look for files with a question mark in them? Renaming the files isn't desirable, as it would break inbound links. I can't tell the inbound linkers to %-encode their URLs.

  2. The files don't end with HTML, so most webservers won't send an html content-type header. What configuration parameters should I look for to tell it to force a 'text/html' content-type for all files in a directory or matching a certain pattern?

I'm using lighttpd ultimately, but if you know what sort of configuration might get the desired results with apache/nginx I'd love to hear that too.

masegaloeh
  • 17,978
  • 9
  • 56
  • 104
user67641
  • 1,242
  • 2
  • 14
  • 18
  • A hideously ugly solution is to set server.error-handler-404 to a script, and have the script look for the filename (in $ENV{REQUEST_URI}), read it, and return it. That's the approach I'm using for a similar "wget"'d site. –  May 24 '15 at 02:17

3 Answers3

6

wget downloads these directly to files named the same, complete with question marks.

You can disable that behavior with --restrict-file-names=ascii,windows, this resolves your issue right on wget, without needing fancy server configs.

Hello World
  • 161
  • 1
  • 2
  • As mentioned in the question and the comment on another answer, a design goal here is to not break inbound links. If you could recommend a server setup that allows an inbound link with a questionmark to be translated to the result of ``--restrict-file-names...``, then this could work. – user67641 Dec 23 '14 at 17:46
  • 1
    What is your definition of `inbound links`? – Hello World Dec 24 '14 at 08:45
  • Say another site on the web links to ``https://mysite.com/gallery.php?foo=bar``. If someone clicks that link, will they 404? That's the goal of this question: create a static mirror for an app that used query strings, but which doesn't break existing links on other sites over which I have no control. – user67641 Dec 24 '14 at 14:22
  • This worked like charm. `wget -r -D example.com -k -np --restrict-file-names=ascii,windows http://example.com` – user9869932 Aug 16 '15 at 19:24
3

I think you can also fix this by changing the way wget downloads the php files:

wget -r --adjust-extension --convert-links 'http://example.com/index.php?foo=bar'

Option --adjust-extension makes wget save the PHP files with a .html extension, e.g. index.php?foo=bar.html

Option --convert-links makes wget convert the links in the downloaded files to the newly created .html files. Note that this conversion takes place after all files have been downloaded.

See also: http://fvue.nl/wiki/Wget_storing_files_with_question_marks

Freddy Vulto
  • 131
  • 2
  • Inbound links will still be broken by this approach, unless the webserver is able to rewrite inbound requests that lack ".html". – user67641 Jan 09 '13 at 23:09
0

I think you can use mod_rewrite in Apache to do this. Ideally, if you tell mod_rewrite to do what looks like a useless rewrite, you can trick it into thinking it should serve a file whose name includes the query-string. Put something like this in your server config (not, unfortunately, in a .htaccess or a <Directory> block)

RewriteEngine on
RewriteCond %{QUERY_STRING} (.*)
RewriteRule ^(.*) /path/to/webdir/$1?%1

I don't know what this will do to URLs with multiple question-marks. I think it'll also append a question-mark to URLs with no query-string. You could change the first regexp to (.+), but then it'd strip the question-mark from URLs with an empty query-string.

If that doesn't work, you could rename the files to some name without question-marks (e.g. change them all to %s or something) and use:

RewriteEngine on
RewriteCond %{QUERY_STRING} (.*)
RewriteRule ^(.*) /path/to/webdir/$1\%%1

I don't know how this deals with PATH_INFO. If Gallery uses it, you'll need to maybe add something like

RewriteCond %{PATH_INFO} (.*)
RewriteRule ^(.*) /path/to/webdir/$1/%1

(But then you'd have a conflict if Gallery used both "http://.../index.php" and "http://.../index.php/foobar", since you couldn't have index.php on the filesystem be both a file and a directory. You could get around that by doing some more name munging.)

While we're throwing in a bunch of mod_rewrite, might as well use it to set MIME types:

RewriteRule \.php - [T=text/html]

or

RewriteCond %{REQUEST_FILENAME} \.jpg$
RewriteRule ^ - [T=image/jpeg]

or similar stuff. (Note how the first one would break if an album or photo name contained ".php", etc.)

Let us know how it turns out!

jon
  • 890
  • 5
  • 15