How to search. pdf links on a given web page?

0

I have this rss page with a lot of links to .pdf files.

The thing is to search for certain strings inside those. pdfs, without the need to open them one by one, and do the search for each one, because they do are a lot!

Does anyone knows a way we may have to search inside those pdfs?

Any ideas? Any browser extension ? Any RSS feed that allows this ?

MEM

Posted 2013-05-30T12:23:21.083

Reputation: 907

Answers

1

You can always use Google.

filetype:pdf site:http://xyz.com/abc <your keyword(s) here> would do the job for you. You just need to find the common domain in the PDFs. By that I mean that if there are two PDFs on the page that are located at http://xyz.com/abc/1.pdf and http://xyz.com/abc/2.pdf then you can use site:http://xyz.com/abc. Only site:http://xyz.com would work too, but would bring you all the PDFs that it finds on the website.

So you want to be specific.

Parth Kohli

Posted 2013-05-30T12:23:21.083

Reputation: 138

This will of course work only if search engines have been allowed to index those files. – Karan – 2013-05-30T23:25:11.957

@Karan I am pretty sure that is the case here. – Parth Kohli – 2013-05-31T02:02:19.133

Might be the case here perhaps (I didn't bother to verify), but my comment was of course an addendum to your answer since people other than the OP will be reading it later and in their case things might be different and so they can't always use Google. – Karan – 2013-05-31T02:06:36.277

Doesn't return any results. Here's the site in question with the exact command used: filetype:pdf site:http://dre.pt/sug/notificacoes/rss.asp?id=212 Guarda Braga Guarda and Braga being the keywords. – MEM – 2013-05-31T16:53:39.737

OK, then either it is not allowed to index those files, or these keywords are not there in the files. Try filetype:pdf site:http://dre.pt Guarda Braga – Parth Kohli – 2013-05-31T17:01:37.840

Sir, you are a google master. ;) Cheers. I added OR and other operators to better refine the search. 5 starts. Not sure however, if the index is accurate in time, is it ? Or it will only list keywords WHEN the .pdfs are scanned by google bots? – MEM – 2013-05-31T17:05:49.223

lol, there's no mastery. – Parth Kohli – 2013-05-31T17:07:08.130

0

download the files first, then search

find /path -name '*.pdf' -ls -exec pdftotext {} - \; | grep "your query"

jet

Posted 2013-05-30T12:23:21.083

Reputation: 2 675

Thanks. I didn't intend to download them each time I do a search. But that command is indeed nice. :) – MEM – 2013-05-31T17:08:12.517