using wget to download from website with case-sensitive URLs to Windows

0

I used Gnuwin32 wget to download emacs manuals with this command (takes about 30 minutes):

wget --mirror --page-requisites --convert-links --no-parent --accept .html,.htm,.css,.js http://www.gnu.org/software/emacs/manual/

the downloaded manuals seemed fine except for one problem, Windows does not distinguish between index.html and Index.html, so wget downloads the two into same path in Windows. For example,

http://www.gnu.org/software/emacs/manual/html_node/elisp/index.html

and

http://www.gnu.org/software/emacs/manual/html_node/elisp/Index.html

are different URLs, and both download to

current_folder/www.gnu.org/software/emacs/manual/html_node/elisp/index.html

Is there a way to work around this?

Update:

alternate example that doesn't take 30 minutes (only takes 30 seconds)

wget -P new --mirror --page-requisites --convert-links --no-parent --accept .html,.htm,.css,.js http://www.gnu.org/software/emacs/manual/html_node/ses/index.html

with --no-clobber

wget -P new-nc --no-clobber -r -l inf --page-requisites --convert-links --no-parent --accept .html,.htm,.css,.js http://www.gnu.org/software/emacs/manual/html_node/ses/index.html

Jisang Yoo

Posted 2012-07-10T21:43:56.983

Reputation: 427

Answers

1

The Win32 subsystem on Windows is unable to distinguish between files that differ only in name. Understanding the full implications of what I just wrote requires a lot of research into the internals of Windows.

To put it briefly, every program on Windows runs under some "subsystem". A subsystem is a user space "stack" sitting on top of the kernel which has a common set of APIs and libraries.

There are only three subsystems: POSIX, Win32, and OS/2. OS/2 is deprecated and probably doesn't work. Win32 is what 99.9999% of all programs (including those part of Gnuwin32 and Cygwin) run under. POSIX is what Services For UNIX (SFU) runs under.

How do you make Windows 7 fully case-sensitive with respect to the filesystem? has some good answers and some bad answers. Ignore the detritus about the registry settings; that's all hogwash. The relevant comment is venimus's "update" comment.

To put it simply, the only way you can run a program on Windows that can properly distinguish between files whose names differ only in case is to use the Subsystem for UNIX. Fortunately for you, wget is a very common program under such subsystem, so you should be able to install SFU (if you are so licensed to do so) and play with it. Good luck.

allquixotic

Posted 2012-07-10T21:43:56.983

Reputation: 32 256

2Get VirtualBox and run a capable VM is what works the best. The average person isn't running Server 2003 or 2008 so SFU is probably not an option. – Fiasco Labs – 2012-07-10T22:05:06.753

Ah, true.. I'm so used to running Server 2008 Standard as a workstation OS that I forget that the licensing for SUA/SFU is extremely restrictive. You can only get it on Server OSes or on Windows 7 Enterprise. D'oh. – allquixotic – 2012-07-10T22:06:55.157

Instead of using a VM, is there a way to make wget rename Index.html to index_1.html and modify appropriate A tags in other downloaded html files as well? – Jisang Yoo – 2012-07-11T09:29:54.620

wget isn't that fancy, unfortunately. You can pass the --no-clobber flag to get it to refuse to overwrite existing files, but it definitely won't update tags in the HTML. It's just a downloader; it doesn't understand HTML. – allquixotic – 2012-07-11T12:11:27.340