Crawl website for files

-1

Hi I'd like to download all PDF's from http://www.allitebooks.com/ and would like to use wget. my command is "http://www.allitebooks.com/" -P "C:\dummydir" -c -A pdf -r but I believe it cannot follow the links to the subdomain for now, how can I fix it so it downloads http://file.allitebooks.com/20170105/Internet%20of%20Things%20and%20Big%20Data%20Technologies%20for%20Next%20Generation%20Healthcare.pdf for example.

Thomas

Posted 2017-01-06T11:14:27.450

Reputation: 151

Answers

1

I was initially going to suggest wget as a solution but upon further research I noticed a few things:

From visiting one of the eBook pages on the site, you can see the URL for the PDF download link. This can be used to download the PDF as follows:

wget http://file.allitebooks.com/20170102/Smart%20Home%20Automation%20with%20Linux%20and%20Raspberry%20Pi,%202%20edition.pdf

However this is not recursive and there is no way to know what is in that directory without checking every blog post and copying the download links.

stuts

Posted 2017-01-06T11:14:27.450

Reputation: 136

but is there no tool in the world thats visits all links to a certain depth and downloads all files that end with .pdf extension? I believe there should be one right? – Thomas – 2017-01-06T11:41:42.563

There definitely are ways to do it. In fact, I wrote a blog post about Recursively Downloading a Website.

The problem here is not that the tool does not exist but that the website you want to download PDFs from is secure enough that it prevents any sort of recursive download of the site.

– stuts – 2017-01-06T11:45:22.993

OK, I will write my own crawler then if there are no out of the box tools. I'd like to fill an e-reader with those ebooks to have some info to read on the go. – Thomas – 2017-01-06T11:54:22.117

HTTrack or ScrapBook may be able to do what you're looking for but as far as that specific site goes you won't be able to download all the PDFs non-interactively. I would suggest that you find a few eBooks that you'd like to read from the site and just manually download them. Best of luck with your crawler program :)

If you find my answer helped provide a solution of some kind then please remember to accept it as a solution! – stuts – 2017-01-06T12:20:31.803

Yo stuts, I upvoted it, but it's not an answer that helps out to achieve my goal, so no accept man – Thomas – 2017-01-06T12:36:02.670

That's totally understandable dude. Still trying to get to grips with the answering system! – stuts – 2017-01-06T12:59:10.567