How to convert a Persian pdf file to Microsoft word format?

1

1

I have a pdf file which is Persian script and it is a Right-to-Left. Since Persian uses UTF-8 format therefore I can't convert it into a plain text in Microsoft Word, also I can't copy-paste the text resulting unreadable characters. I have tried a lot of softwares such as unipdf and e-Pdf Converter however after the conversion still the characters are not displayed properly. I even tried OCR but again same problem appeared. The pdf does'nt have any password or restrictions.

Does anyone have any other ideas?

Edit: I actually tried creating a file in MS Word and converting it to a PDF, after that again I had the same problem with the PDF file.(even the encoding was known)

Mehdi

Posted 2015-05-06T13:09:58.047

Reputation: 11

3Microsoft Word supports UTF-8 format. It also supports right to left languages. So why exactly can't you convert it to a Word document? – Ramhound – 2015-05-06T13:14:47.827

Hey thanx for your consideration. The source of my file is PDF so I don't know what exactly happens when I try to copy and paste it in Microsoft Word, but it doesn't show proper character. The same thing happens when I try to convert it using third party tools. – Mehdi – 2015-05-06T13:21:40.853

1

possible duplicate of Cutting & Pasting Vietnamese characters from a PDF

– RedGrittyBrick – 2015-05-06T14:27:32.590

@RedGrittyBrick I read your answer. but in my case I actually tried creating a file in MS Word and converting it to a PDF, after that again I had the same problem with the PDF file.(even the encoding was known)- Thanks – Mehdi – 2015-05-06T14:59:51.293

How was the PDF created? Electronically or scanned and you are hoping for OCR to take over? – Austin T French – 2015-05-06T15:39:51.427

Can you create an example PDF and post it somewhere public so that people can download it from there using a URL? – RedGrittyBrick – 2015-05-06T16:04:08.173

@AthomSfere The PDF was created automatically by converting a MS Word file into a pdf. Thanks – Mehdi – 2015-05-09T12:13:47.187

@RedGrittyBrick Here is an example of PDF https://drive.google.com/open?id=0BzLHaKpzBvMNZXZrd1NURWhIS0F4OGkzVldSRm1ZYXJXbHNF&authuser=0

– Mehdi – 2015-05-09T12:14:04.700

I can cut and paste text from that using Chrome's built-in PDF viewer - there is no obvious garbling of the characters but the direction of text is mostly reversed. I don't read Persian so can't tell whether the actual characters are all OK - but they look superficially OK. With a different PDF viewer, eVince, the main problem is selecting contiguous text. Unfortunately I don't think I can help with your problem. – RedGrittyBrick – 2015-05-09T22:39:48.747

@RedGrittyBrick Thank you very much for your consideration. this problem exists with non-English PDF and I don't the reason! however, you have already helped me, I cant copy-paste portion by portion! the long way but the only way! – Mehdi – 2015-05-10T13:25:20.637

Answers

1

Very often PDF files in non-Latin scripts (especially RTL scripts such as Arabic, Hebrew and Farsi) are generated by software which sort of LTR-ifies the text at the word or sentence-fragment level, or just somehow gets the right glyphs to display but you get gibberish for the 'logical' text. In these cases there is very little to be done except write a custom back-converter which is effectively not an option.

However, if you can figure out how the file is created - which is often indicated in the meta-data accessible using common PDF readers - there might be an option to open the file in the application which generated it, or at least you could make your question more specific.

einpoklum

Posted 2015-05-06T13:09:58.047

Reputation: 5 032

0

I have currently worked to convert a pdf to an editable Persian text. The best solution I have found is to use google doc as follows.

  1. You should convert pdf pages to images. For this you can use Adobe acrobat reader( Not the adobe reader which is free) or in Linux I use GIMP to open a pdf and then I select to open each page in a separate image. It's your own choice.
  2. Upload the image files to Google Drive
  3. Go to Google Drive and right click on each image then click open with google doc
  4. wait until google doc open an editable text from your image
  5. Copy it to word

I dont know if there are any automated method. I hope some time I have time to make an application for doing this automatically.

Merlin

Posted 2015-05-06T13:09:58.047

Reputation: 111

0

I had the same problem with converting pdf files to word. After copy/paste in Word, the formatting changed and caused trouble. I tried several online converters but they also failed.
The only method that worked was as follows:

  1. Open the pdf file with Adobe Acrobat Reader, then from the file menu choose print. From the printer names, choose adobe acrobat. Yes, you are about to create a pdf from a pdf!
  2. Open the new pdf file with Google Chrome (drag and drop the file onto Chrome).
  3. Now simply select all the text (ctrl + A) and copy/paste it into a blank Word file.

saeed ghasemi

Posted 2015-05-06T13:09:58.047

Reputation: 1

0

I know it's too late for the answering but for anyone having the same question, I could suggest Delix.ir which is a Persian OCR and PDF to word converter.

Disclaimer: I'm the founder of delix.ir and I hope it won't be treated as a advertisement.

Amirreza Nasiri

Posted 2015-05-06T13:09:58.047

Reputation: 2 418