Crawl website for files

I was initially going to suggest wget as a solution but upon further research I noticed a few things:

PDF files are not stored at http://www.allitebooks.com (instead they are at http://files.allitebooks.com)
The directory containing the PDFs is http://file.allitebooks.com/20170102/
Both of the files.allitebooks URLs I have mentioned return 403 Forbiddenerrors upon connection

From visiting one of the eBook pages on the site, you can see the URL for the PDF download link. This can be used to download the PDF as follows:

wget http://file.allitebooks.com/20170102/Smart%20Home%20Automation%20with%20Linux%20and%20Raspberry%20Pi,%202%20edition.pdf

However this is not recursive and there is no way to know what is in that directory without checking every blog post and copying the download links.

stuts

Posted 2017-01-06T11:14:27.450

Reputation: 136

but is there no tool in the world thats visits all links to a certain depth and downloads all files that end with .pdf extension? I believe there should be one right? – Thomas – 2017-01-06T11:41:42.563

There definitely are ways to do it. In fact, I wrote a blog post about Recursively Downloading a Website.

The problem here is not that the tool does not exist but that the website you want to download PDFs from is secure enough that it prevents any sort of recursive download of the site.

– stuts – 2017-01-06T11:45:22.993

OK, I will write my own crawler then if there are no out of the box tools. I'd like to fill an e-reader with those ebooks to have some info to read on the go. – Thomas – 2017-01-06T11:54:22.117

HTTrack or ScrapBook may be able to do what you're looking for but as far as that specific site goes you won't be able to download all the PDFs non-interactively. I would suggest that you find a few eBooks that you'd like to read from the site and just manually download them. Best of luck with your crawler program :)

If you find my answer helped provide a solution of some kind then please remember to accept it as a solution! – stuts – 2017-01-06T12:20:31.803

Yo stuts, I upvoted it, but it's not an answer that helps out to achieve my goal, so no accept man – Thomas – 2017-01-06T12:36:02.670

That's totally understandable dude. Still trying to get to grips with the answering system! – stuts – 2017-01-06T12:59:10.567

Crawl website for files

Answers