How do I run this `find` command, but only on non-binary files?

8

1

I want to remove trailing whitespace from all the files in a recursive directory hierarchy. I use this:

find * -type f -exec sed 's/[ \t]*$//' -i {} \;

This works, but will also remove trailing "whitespace" from binary files that are found, which is undesirable.

How do I tell find to avoid running this command on binary files?

John Feminella

Posted 2011-08-26T14:16:28.523

Reputation: 1 582

Unix filesystems make no distinction between "binary" and "non-binary" files; there's no way to tell what type of data is in the file without looking inside it. – Wooble – 2011-08-26T14:20:50.380

@Wooble: That is correct, but there are commands such as file which can inspect the data. – John Feminella – 2011-08-26T14:24:20.660

Answers

4

You could try to use the Unix file command to help identify the files you don't want, but I think it may be better if you explicitly specify what files you want to hit rather than those you don't.

find * -type f \( -name \*.java -o -name \*.c -o -name \*.sql \) -exec sed 's/[ \t]*$//' -i {} \;

to avoid traversing into source control files you might want something like

find * \! \( -name .svn -prune \) -type f \( -name \*.java -o -name \*.c -o -name \*.sql \) -exec sed 's/[ \t]*$//' -i {} \;

You may or may not need some of the backslashes depending on your shell.

Bert F

Posted 2011-08-26T14:16:28.523

Reputation: 157

2I don’t know about you, but all of our Java source files are always in standard UTF-8, so that sed command won’t always do the right thing with all of those. I also have systems without a -i option to sed. It’s hard to write a portable shell command, isn’t it? – tchrist – 2011-08-26T15:21:56.037

4

It can be done on the command line.

$ find . -type f -print|xargs file|grep ASCII|cut -d: -f1|xargs sed 's/[ \t]*$//' -i

Vijay

Posted 2011-08-26T14:16:28.523

Reputation: 711

3

The simplest and most portable answer is to run this:

#!/usr/bin/env perl
use strict;
use warnings;
use File::Find;
my @dirs = (@ARGV == 0) ? <*> : @ARGV;
find sub {
    next unless -f && -T;
    system('perl', '-i', '-pe', 's/[\t\xA0 ]+$//', $File::Find::name);
} => @dirs;

I explain why below, where I also show how to do it using just the command line, as well as how to deal with trans-ASCII textfiles like ISO-8859-1 (Latin-1) and UTF-8, which aften have non-ASCII whitespace in them.


The Rest of the Story

The problem is that find(1) doesn’t support the -T filetest operator, nor does it recognize encodings if it did — which you absolutely need to detect UTF-8, de facto standard Unicode encoding.

What you could do is run the filename list through a layer that throws out binary files. For example

$ find . -type f | perl -nle 'print if -T' | xargs sed -i 's/[ \t]*$//'

However now you have trouble with whitespace in your filenames, so you need to late this with null termination:

$ find . -type f -print0 | perl -0 -nle 'print if -T' | xargs -0 sed -i 's/[ \t]*$//'

Another thing you could do is use not find but find2perl, since Perl understands -T already:

$ find2perl * -type T -exec sed 's/[ \t]*$//' -i {} \; | perl

And if you want Perl to assume its files are in UTF-8, use

$ find2perl * -type T -exec sed 's/[ \t]*$//' -i {} \; | perl -CSD

Or you could save the resulting script in a file and edit it. You really really should not just run the -T filetest on any old file, but rather only on those that are plain files as first determined by -f. Otherwise you risk opening device specials, blocking on fifos, etc.

However, if you are going to do all that, you might as well skip sed(1) altogether. For one thing, it’s more portable, since the POSIX version of sed(1) does not understand -i, whereas all versions of Perl do. Latterday versions of sed lovingly appropriated the very useful -i option from Perl where ti first appears.

This also gives you the opportunity to fix your regex, too. You should really be using a pattern that matches one or more trailing horizontal whitespace, not just zero of them, or you will run slower from unnecessary copying. That is, this:

 s/[ \t]*$//

should be

 s/[ \t]+$//

However, how to get sed(1) to understand that requires a non-POSIX extension, usually either -R for System Ⅴ Unices like Solaris or Linux, or -E for BSD ones like OpenBSD or MacOS. I suspect it is impossible under AIX. It is alas easier to write a portable shell than a portable shell script, you know.

Warning on 0xA0

Although those are the only horizontal white space characters in ASCII, both ISO-8859-1 and consequently also Unicode have the NO-BREAK SPACE at code point U+00A0. This is one of the top two non-ASCII characters found in many Unicode corpora, and I have lately seen a lot of people’s regex code break because they forgot about it.

So why don’t you just do this:

$ find * -print0 | perl -0 -nle 'print if -f && -T' | xargs -0 perl -i -pe 's/[\t\xA0 ]+$//'

If you might have UTF-8 files to deal with, add -CSD, and if you are running Perl v5.10 or greater, you can use \h for horizontal whitespace and \R for a generic linebreak, which includes \r, \n, \r\n, \f, \cK, \x{2028}, and \x{2029}:

$ find * -print0 | perl -0 -nle 'print if -f && -T' | xargs -0 perl -CSD -i -pe 's/\h+(?=\R*$)//'

That will work on all UTF-8 files no matter their linebreaks, getting rid of trailing horizontal whitespace (Unicode character property HorizSpace) including the pesky NO-BREAK SPACE that occurs before a Unicode linebreak (include CRLF combos) at the end of each line.

It is also a lot more portable than the sed(1) version, because there is only one perl(1) implementation, but many of sed(1).

The main problem I see remaining there is with find(1), since on some truly recalcitrant systems (you know who you are, AIX and Solaris), it won’t understand the supercritical -print0 directive. If that is your situation, then you should just use the File::Find module from Perl directly, and use no other Unix utilities. Here is a pure Perl version of your code that doesn’t rely on anything else:

#!/usr/bin/env perl
use strict;
use warnings;
use File::Find;
my @dirs = (@ARGV == 0) ? <*> : @ARGV;
find sub {
     next unless -f && -T;
     system('perl', '-i', '-pe', 's/[\t\xA0 ]+$//', $File::Find::name);  
} => @dirs;

If you are running on just ASCII or ISO-8859-1 textfiles, that’s fine, but if you are running with ASCII or UTF-8 files, add -CSD to the switches in the interior call to Perl.

If you have mixed encodings of all three of ASCII, ISO-8859-1, and UTF-8, then I fear you have another problem. :( You will have to figure out the encoding on a per-file basis, and there is never a good way to guess that.

Unicode Whitespace

For the record, Unicode has 26 different whitespace characters. You can use the unichars utility to sniff these out. Only the first three horizontal whitespace chars are almost ever seen:

$ unichars '\h'
 ---- U+0009 CHARACTER TABULATION
 ---- U+0020 SPACE
 ---- U+00A0 NO-BREAK SPACE
 ---- U+1680 OGHAM SPACE MARK
 ---- U+180E MONGOLIAN VOWEL SEPARATOR
 ---- U+2000 EN QUAD
 ---- U+2001 EM QUAD
 ---- U+2002 EN SPACE
 ---- U+2003 EM SPACE
 ---- U+2004 THREE-PER-EM SPACE
 ---- U+2005 FOUR-PER-EM SPACE
 ---- U+2006 SIX-PER-EM SPACE
 ---- U+2007 FIGURE SPACE
 ---- U+2008 PUNCTUATION SPACE
 ---- U+2009 THIN SPACE
 ---- U+200A HAIR SPACE
 ---- U+202F NARROW NO-BREAK SPACE
 ---- U+205F MEDIUM MATHEMATICAL SPACE
 ---- U+3000 IDEOGRAPHIC SPACE

$ unichars '\v'
 ---- U+000A LINE FEED (LF)
 ---- U+000B LINE TABULATION
 ---- U+000C FORM FEED (FF)
 ---- U+000D CARRIAGE RETURN (CR)
 ---- U+0085 NEXT LINE (NEL)
 ---- U+2028 LINE SEPARATOR
 ---- U+2029 PARAGRAPH SEPARATOR

tchrist

Posted 2011-08-26T14:16:28.523

Reputation: 218

0

GNU grep is pretty good at identifying whether a file is binary or not. Other than Solaris I'm sure there's other platforms that don't come with GNU grep installed by default, but like Solaris I'm sure you can get it installed.

perl -pi -e 's{[ \t]+$}{}g' `grep -lRIP '[ \t]+$' .`

If you're in Solaris, you'd replace grep with /opt/csw/bin/ggrep.

The grep flags do the following: l only lists filenames for matching files, R is recursive, I matches only text files (ignores binary files), and P is for perl-compatible regular expression syntax.

The perl portion modifies the file in-place, deleting all trailing spaces/tabs.

Lastly: if UTF8 is an issue, tchrist's answer coupled with mine should be sufficient, provided the build of grep you have was built with UTF8 support (usually package maintainers try to provide that kind of functionality, though).

Brian Vandenberg

Posted 2011-08-26T14:16:28.523

Reputation: 504