How to extract the text from MS Office documents in Linux?

18

5

I need a way to extract the text from all of the MS Office document types (Word, Excel, Powerpoint), in Linux. I envision that there might be several different approaches to accomplish this, such as a Bash or Python script, or converting them to PDF and then extracting the text using a tool such as pdftotext.

This seems like it might be a commonplace requirement. Is there an established procedure or tool to accomplish this easily?

Phyo Arkar Lwin

Posted 2010-07-21T05:35:23.440

Reputation: 401

Answers

8

I finally found the perfect tool for scripting document parsing , it is apache-tika , it can parse gazillion non-text formats into text which is very cool!

Get Apache Tika here:

http://tika.apache.org/

(Mac Homebrew users: brew install tika)

The command-line interface works like this:

tika --text something.docx > something.txt

Phyo Arkar Lwin

Posted 2010-07-21T05:35:23.440

Reputation: 401

+1: Apache Tika is a serious Open source project, works also in Windows, works from the command line, it has a GUI with drag and drop, opens anything (Word, Excel, PowerPoint, PDF, svg), extracts the metadata of the document as well. After trying most the tools above, Apache Tika is what I was looking for. This should be the accepted answer (I don't know if you can accept your own answer) – user2518618 – 2015-10-20T12:04:54.147

2did , shamelessly ... :D – Phyo Arkar Lwin – 2015-10-22T15:58:29.833

16

Catdoc can convert doc,xls & ppt to text. Second option would be wvWare.

For more utils check check http://www.linux.com/archive/articles/52385 for word to text coverters and

nahar

Posted 2010-07-21T05:35:23.440

Reputation: 1 476

1great, catdoc gives me segmentation fault – fotanus – 2014-07-30T21:11:58.873

Catdoc! Thats the thing i am looking for! Will it also work for ODF ? – Phyo Arkar Lwin – 2010-07-22T04:38:24.853

Just googled & got http://stosberg.net/odt2txt/. never tried it, seems like it does the job.

– nahar – 2010-07-22T05:27:55.000

cool thanks. catdoc is ok but it cant convert xls,ppt to test , i use xls2csv and apache-tika for them. check them out! – Phyo Arkar Lwin – 2010-10-30T21:37:38.377

@nahar, odt2txt only works on odt format, not ms doc. – Allen – 2013-01-09T06:28:40.637

7

Abiword can convert from the commandline between any file formats it knows.

Convert from Word to plain text:

abiword --to=txt myfile.doc

Make a pdf from a Word file:

abiword --to=pdf myfile.doc

And so on. The results in these cases would be myfile.txt or myfile.pdf. If you want to specify the output name you can do that too:

abiword --to=txt --to-name=output.txt myfile.doc

Convert ODT to Word:

abiword --to=doc myfile.odt

Convert Word to ODT:

abiword --to=odt myfile.doc

In fairness to other answers, it should be noted that AbiWord uses wvWare to handle Word documents, but even the wvWare homepage recommends using AbiWord instead for most conversions.

I hate word processors. This is the main reason I have AbiWord installed.

You might also be interested in unoconv, which is a similar tool supporting formats OpenOffice knows (which would include spreadsheets and the like), but I have no experience with it personally.

frabjous

Posted 2010-07-21T05:35:23.440

Reputation: 9 044

For .docx document it mess up :S But a nice solution for .doc – Warface – 2014-04-03T13:20:09.213

3

With LibreOffice you can do:

libreoffice --invisible --convert-to pdf file1.ppt file2.ppt

Dorian

Posted 2010-07-21T05:35:23.440

Reputation: 1 191

4You can use the Text filter to convert to txt: libreoffice --invisible --convert-to txt:Text files – Gagaro – 2014-04-23T08:57:18.537

0

You could use CUPS ( virtual printer ) and by using ld.

ukanth

Posted 2010-07-21T05:35:23.440

Reputation: 9 930

Interesting , can that convert any printable stuff to PDF? Can you point me and example doing that for Doc or Xls? – Phyo Arkar Lwin – 2010-07-22T04:40:13.820

0

wv is one option and IIRC OpenOffice can be told from the command line to export as pdf and exit.

ptman

Posted 2010-07-21T05:35:23.440

Reputation: 226

unoconv seems to be the OpenOffice-related tool I couldn't remember. – ptman – 2010-10-23T11:21:09.853

0

If you want to use Apache Tika in a Python project, please check out this blog post.

qris

Posted 2010-07-21T05:35:23.440

Reputation: 275

0

1.doc catdoc or antiword for doc files to convert you can use the following command catdoc file.doc > file or antiword file.doc > file

  1. docx docx2txt

  2. pdf emacs file.pdf ctrl-x ctrl-s file

Linuxant

Posted 2010-07-21T05:35:23.440

Reputation: 1

(1) Catdoc was proposed in an answer that was posted within an hour of the question, almost three years ago.  Why are you repeating it?  (2) Where can antiword be obtained?  (3) What does the bottom half of your answer mean? – Scott – 2013-04-15T17:24:24.053

-1

Docsplit is the perfect tool to extract the text from pdf. It's a ruby gem. So you should install the ruby and gem in your linux system before using the docsplit command.

If your system didn't have the ruby and gem, please follow the instruction.

You must be root to install the software (assuming you want it to be available to all users).

Install the ruby on linux : yum install ruby
Install the gem, please download the latest gem package, then follow the instruction

tar xzf rubygems-x.x.xx.tgz
cd rubygems-x.x.xx
ruby setup.rb

Now that RubyGems is installed, you should have the gem command (gem is the command used to interact with the RubyGems package system). Test it out by running:

gem list

Now, move to next step, to install the docsplit gem, please goto the following site. http://documentcloud.github.com/docsplit/

Mr. Black

Posted 2010-07-21T05:35:23.440

Reputation: 337

This is the exact opposite of what the OP asked for. – CarlF – 2011-08-15T12:54:03.810