How to retrieve all *.html files from website using Unix command line tools and regular expression

1

I would like to retrieve all .html files from a website that has a certain text on its name:

eg. this_is_good_site.html

So, I would like to download .html files with a word "good" on its name. I tried wget and curl, but I did not understand how can I choose those files using a regular expression? Is there a Python or Perl solution, if command line tools on Unix can't do this?

jraja

Posted 2010-01-18T19:48:47.893

Reputation:

Answers

1

Well, if you're wanting to do it with Python you might look into using urlib2 - you would also probably have better luck with this question on StackOverflow.

Darren Newton

Posted 2010-01-18T19:48:47.893

Reputation: 1 228

2

As you're using a Unix environment, try this using wget's Recursive Accept/Reject Options;

wget -r -A "*good*" <site_to_download>

This will perform a recursive (-r) download of the site, and only accept (-A) paths which match the pattern ("*good*")

Toby Jackson

Posted 2010-01-18T19:48:47.893

Reputation: 121

1

Try HTTrack website copier or a similar program, better than command line. download it all to a directory, sort by .html copy and paste them all somewhere else, delete the leftovers

http://www.httrack.com/

alpha1

Posted 2010-01-18T19:48:47.893

Reputation: 1 638