Batch convert .doc files to .txt (plain ascii text) and/or .html recursively in folders and subfolders, Windows and Mac?



Is there a tool to do this. I've seen some Python/Java tools to automate OpenOffice but has anyone reliably scripted this to do more than one file, and recurse through a folder/directory tree with .doc files in it, placing the converted .txt and .html files next to the original file in its location.


Posted 2011-03-02T16:12:10.593

Reputation: 3 596



@slhck your solution almost works but the output is to the display / STDOUT with all the files concatenated together. I need individual .txt files as output. Reason is that we're not accounting for the filename in the output.

To work around having to traverse a folder hierarchy If I use Windows search for *.doc and then copy the results to a folder to put them all in one folder flattened, then I can boot into Ubuntu and run the following.

(I have a file/folder recursion piece of code somewhere which I will dig out and add to later if time.) But for now just flattening the file hierachy as above is good enough.

By the way, catdoc works better than antiword because antiword complains some files aren't word docs, these tend to be .doc files with formatting and blocks of text organised as frames within the doc. catdoc seems to convert all of my docs.

#!/usr/bin/perl -w

 use File::Basename;

 my $okFiles = "";
 my $couldntGet = "";

 @files = <*>;
 foreach $file (@files) 
   if ( $file =~ m/\.doc/ )
     my ( $filenameOnly, $dir, $ext ) = fileparse($file, qr/\.[^.]*/);
     if ( ( defined $filenameOnly ) && ( defined $ext ) )
       $okFiles .= "file: ".$file." filename only:".$filenameOnly." extension:".$ext."\n";

       system( "catdoc \"".$file."\" > \"".$filenameOnly.".txt\"" );
       $couldntGet .= "*file: ".$file." - couldn't get filename only and extension\n";

   print $okFiles;

   print $couldntGet;


Posted 2011-03-02T16:12:10.593

Reputation: 3 596



There are two Unix tools I know of:

  • catdoc
  • antiword

You could just use find to go through the folder recursively

find . -name "*.doc" -exec <command> {} \;

Where <command> is the appropriate action to convert the .doc file into a .txt file, using either catdoc or antiword.

Mac OS X

You can use the same tools, but you'll have to install them using, for example, Homebrew. To do this, enter in the Terminal:

ruby -e "$(curl -fsSL"

And then:

brew install catdoc
brew install antiword


Posted 2011-03-02T16:12:10.593

Reputation: 182 472

+1 for the solution. As in question, Window or Mac please but I also have Ubuntu so hope to be able to use your solution. I'll look it up, try it and if it works then I'll accept your answer. Thanks. – therobyouknow – 2011-03-03T08:51:15.137

1I added installation instructions for OS X in the post. I haven't tried the <command> part yet, but I can look into that if you have any troubles. – slhck – 2011-03-03T10:50:02.793


catdoc and antiword have very limited file format support, the latest version they understand is Word 2000.

I know you can script LibreOffice to convert any files it understands to text or pdf (this is what MediaGoblin does) but I don't know how exactly to do that.


Posted 2011-03-02T16:12:10.593

Reputation: 352