Tools to extract text from powerpoint pptx in linux?

7

1

Any there any linux tools that will extract the text from a Powerpoint pptx file? I tried catppt but it just returns file.pptx is not OLE file or Error. abiword --to=txt file.pptx also returns an empty text file.

I can open the file in libreoffice but it doesn't seem to have an "export as text" option. As a guess I also tried libreoffice --headless --convert-to txt:Text file.pptx but that doesn't even return an empty file.

felix

Posted 2013-10-17T16:55:55.923

Reputation: 471

Tried textedit with this as well, and it had nothing in it. Claims the file is locked too. Wonder if its some sort of proprietary thing microsoft implemented – Simon Sheehan – 2013-10-17T16:59:07.987

Answers

12

If you can process the files in bash, this one-liner will unpack all the text:

unzip -qc "$1" ppt/slides/slide*.xml | grep -oP '(?<=\<a:t\>).*?(?=\</a:t\>)'

Just pass it the pptx file as $1, and it will write the text into file $2. The content of each slide will not appear in presentation order, and there will be no labels or anything, so you'll need a few more lines of script and a temp directory to get a more readable listing.

Byron Hawkins

Posted 2013-10-17T16:55:55.923

Reputation: 280

If you are wondering where you get those XML files from, you just unzip the pptx file and you can find the XML file for each slide in ppt/slides/slide*.xml – arun – 2017-08-09T23:37:10.303

I've been searching for binaries and perl scripts for a while now and this answer is so simple and effective. I'm just looking to get a basic text-dump from slides, nothing fancy, and this does it perfectly. I wish I could give it more +1's! – Chris Haas – 2014-05-07T16:02:46.710

2

Since you have Abiword installed you can just make a PDF first

libreoffice --headless --convert-to pdf filename.pptx

And then use abiword to convert the pdf to txt

abiword --to=txt filename.pdf 

Warface

Posted 2013-10-17T16:55:55.923

Reputation: 232

1

If you have node.js on your machine, https://github.com/dbashford/textract can do it.

xverges

Posted 2013-10-17T16:55:55.923

Reputation: 361

1

If you add .zip at the end of the filename (i.e Presentation1.pptx.zip) you can then unzip the document and view it's indvidual components.

In this resulting zip file there is the following directory \Presentation1.pptx.zip\ppt\slides. This contaions .xml files named after each individual slide. If you open one of these files you will see that any entered text is wrapped in <a:t> tags.

For example: <a:t>TEST</a:t>

This is as far as I can help you, but hopefully it's enough.

EDIT: As a side-note, the same process works for Word Documents as well. It's quite useful if you ever need to extract images from a Word Document.

Michael Frank

Posted 2013-10-17T16:55:55.923

Reputation: 7 412

0

You may try saving the ppt file in pdf format, then use pdftotext on it to retrieve existing text. Of course, all text embedded in figures is lost forever.

Alternatively, you may try the command

 string filename

on either the ppt or the pdf.

MariusMatutiae

Posted 2013-10-17T16:55:55.923

Reputation: 41 321