Wget with URL that contains #

I am trying to download a URL that is like http://www.somesite.com/restaurants.html#photo=22x00085.

I put it in between single quotes, but it only downloads http://www.somesite.com/restaurants.html which is not the correct page.

Is there a solution?

user1289749

Posted 2012-10-13T15:46:46.577

Reputation: 111

can't test this now, but from what I remember %20 works for space, so %23 would probably work for # (%23 is the percent-encoding for #) – lupincho – 2012-10-13T16:03:40.533

3isn't it the same HTML file? The # might just tell the web browser to jump to a particular part of the page. – barlop – 2012-10-13T17:25:36.780

Answers

wget is working fine. The URI syntax specifies that the fragment – the #foo part – is to be interpreted entirely client-side, and not used when retrieving the document itself.

For example, if it's a HTML page, the browser might scroll down to a named section, or – in your case – trigger some JavaScript code that shows a particular photo.

In other words, as far as wget is concerned, the URIs

http://www.somesite.com/restaurants.html#photo=22x00085 and
http://www.somesite.com/restaurants.html

...point to the same page /restaurants.html. It's up to your browser to do the rest. Opening restaurants.html#photo=22x00085 in the browser should work fine.

user1686

Posted 2012-10-13T15:46:46.577

Reputation: 283 655

Without visiting the proper link, I can't tell which one it is, but there are only two options:

The hash actually forms part of the requested document's name. In this case, you can encode it:

http://www.somesite.com/restaurants.html%23photo=22x00085
In the other case, under normal circumstances, http://www.somesite.com/restaurants.html and http://www.somesite.com/restaurants.html#photo=22x00085 should point to the same page. The portion after the hash simply indicates the anchor the browser should scroll to after loading the page; it doesn't even get sent to the server.

However, it is possible, that the hash is (ab)used to load a particular photo with JavaScript. Wget can't interpret JavaScript, so there's nothing you can do about it.

Dennis

Posted 2012-10-13T15:46:46.577

Reputation: 42 934

I've seen many sites that abuse the URL fragment in this way; at the top of the list is Google themselves. This violates a whole bunch of RFCs, but not that many people seem to care, since "it works"... – Michael Hampton – 2012-10-13T21:01:45.467

1@MichaelHampton: Could you point out exactly which RFCs it violates? – user1686 – 2012-10-15T16:48:50.493

@grawity RFC 2396, part 2.4.3 can be read to say # is not part of any URI. This seems to be relaxed in RFC 3986, being vague enough not to define anything. – Rich Homolka – 2012-10-17T15:15:31.303

1@RichHomolka: It only says that "foo#bar" is actually called an "URI-Reference", consisting of the URI (used for data retrieval) and the fragment (interpretation left to the user-agent). It would be violated only if the fragment was actually sent in a HTTP request. – user1686 – 2012-10-17T15:34:23.400

That's not the URL for the image. It's the URL for a page that uses a script or other code to fetch the image. Try loading the page with JavaScript turned off. That's what wget is fetching for you.

To find the URL for the image, try visiting the page through your browser and then right-clicking on the photo. There should be an option to view information about the image, including its URL.

If that doesn't work, it may be because the image is being loaded through Flash or some other client-side program. You can use Fiddler or Wireshark to watch what URL its loading.

If you give us the actual URL of the site with the image, we can help you determine how the image is being loaded.

Jeremy Stein

Posted 2012-10-13T15:46:46.577

Reputation: 584