Using find, xargs, etc. to output similarly named files

3

1

I have a folder full of HTML files:

001.htm
002.htm
003.htm
…

I want to run Pandoc on them to convert them to similarly named Markdown files:

001.md
002.md
003.md

This command works on one of them:

pandoc -f html -t markdown 001.htm -o 001.md

And I want to use find and xargs to automatically run a similar command on every file in the folder.

I got as far as this:

find *.htm | xargs -I {} -n 1 pandoc -f html -t markdown -o {}

…which truncates every file in the directory, so now I'm asking before I really break something.

What is wrong with my command above, and/or what's a completely different / more efficient way to do this?

75th Trombone

Posted 2013-01-25T17:49:12.863

Reputation: 162

Answers

4

I managed to do that with this 1 liner. If you are flexible about the xargs and find part.

for f in ./*.htm; do pandoc -f html -t markdown "$f" -o "${f%.htm}.md"; done

If you want to act recursively (so: all .htm files in the current directory, and all subdirectories), then (assuming bash 4+) you can use the globstar shell option:

shopt -s globstar
for f in ./**/*.htm; do pandoc -f html -t markdown "$f" -o "${f%.htm}.md"; done

Martín Canaval

Posted 2013-01-25T17:49:12.863

Reputation: 901

2+1. xargs doesn't allow you the same flexibility of filename modification. Do not replace *.htm with $(find...) -- filenames with spaces will be properly handled in the first case but not the second. – glenn jackman – 2013-01-25T19:35:33.510

1@glennjackman Unless you set the bash $IFS to $'\n' for that code section, in which case spaces aren't a problem -- newlines still are though. – Daniel Beck – 2013-01-25T20:05:21.940

Wow, there are two or three new things about the command line for me to learn from that snippet. Thanks! – 75th Trombone – 2013-01-25T23:44:29.057

3

Using {} isn't flexible enough for some situations. This appears to be one of those.

A possible workaround would be to -execa script from find, like so:

find . -name '*.htm' -exec ./convert-to-md.sh {} \;

The script file should look similar to this, depending on the exact pandoc command line:

#!/bin/bash
pandoc -f html -t markdown -o "${1/.htm/.md}" "${1}"

If you don't want to create and save a script file for this, you can always inline the bashscript code:

find . -name '*.htm' | xargs -n 1 bash -c 'pandoc -f html -t markdown -o "${1/.htm/.md}" "${1}"' -

The additional -at the end serves to fill $0in bash, which usually includes the name of the shell script, positional arguments starting at $0.

This allows you to keep using find(even with -print0and xargs -0 if you're handling weird file names), but doesn't require creation of a separate file.

Daniel Beck

Posted 2013-01-25T17:49:12.863

Reputation: 98 421

Doesn't find handle weird file names on its own anyway? IIRC there's never a good reason to use find … -print0 | xargs -0 … – slhck – 2013-01-25T19:58:35.430

@slhck Newlines are valid file name components. The following demonstrates how this causes scripts to fail: touch "$( echo -e 'foo\nbar' )" ; find . -name 'foo*bar' | xargs -n 1 echo File: – Daniel Beck – 2013-01-25T20:00:46.993

1

You appear to be missing a {} in the pandoc command

find . -name \*.htm | xargs -I {} -n 1 pandoc -f html -t markdown {} -o {}.md

But then you'll have files named 001.htm.md -- you'll have to decide if this is a problem.

glenn jackman

Posted 2013-01-25T17:49:12.863

Reputation: 18 546