0

We have Windows Sharepoint Services 3 installed on a Server 2003 R2 Enterprise SP2 machine. I have Adobe Reader 8 with the iFilter installed, configured and working. I kicked off a Full crawl and I am returning PDF searches when I use the search. This is a big change then from before when PDF content searching was non existent. Currently the business unit has noticed that for certain words, he is isn't finding the appropriate PDF.

From all indications, it seems that for some PDFs, not all words are indexed. Can someone help?

vsmal
  • 490
  • 4
  • 8
  • For some PDFs are **no** words indexed? That would indicate it's seen as being a picture, rather than a document. But in any event, this may be because your text parser isn't recognizing certain blocks of text as text, and thus, not indexing them. – HopelessN00b Aug 20 '12 at 20:34
  • I am able to search for that document using certain other words (guess work found those out), but for others. If it is a problem with my text parser, what would be a possible solution? – vsmal Aug 20 '12 at 21:09

2 Answers2

1

In terms of searchable text in PDF, there are two types of files: ones that were saved from Word/etc... documents that have "always been digital" and then there are ones that were scanned in from paper, and received OCR to guess what the words are on the paper.

iFilter does not OCR the text in your documents. If your documents were originally scanned by other software, the that software is likely suspect. Nearly all OCR is imperfect, and some is horrible. You can use Acrobat Reader on your computer with the document open to search for words in it. That should tell you how good the OCR in the document is.

Also note this post that suggests OCR'ed text may not work in iFilter 8, and you may need to install Reader 9 on the server.

Lastly, if you can search the words inside a PDF using Acrobat Reader fine, then I would take the document and setup SharePoint + iFilter in a lab with default settings and see if it truly is something wrong with the iFilter.

Bret Fisher
  • 3,963
  • 2
  • 20
  • 25
1

I had followed the various KB articles from Microsoft, the best one that includes everything you need being here, and afterwards still could not search all text content in PDFs.

I had checked to make sure that searching for words within the PDF itself (in Reader) works, and it did, so it was not an OCR issue. For my problem, the following issues were discovered and had to be changed/reverted:

  • Upgrade to Reader X broke PDF content searching completely. I could still search for titles and descriptions, but contents of PDFs were not searchable. I had to reinstall Adobe Reader 8.
  • The service account that ran the search service needs to be a full administrator on the index server.

Summary: I had to add the service account as full administrator and then make sure the documented steps were followed again (confirmation in my case) and now voila, solved.

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
vsmal
  • 490
  • 4
  • 8