How can I find non-ASCII characters in text files?

Is there a tool that can scan a small text file and look for any character not in the simple ASCII character set?

A simple Java or Groovy script would also do.

Marcus Leon

Posted 2011-08-31T00:35:06.527

Reputation: 2 323

It can be moved there, though would think this would be directly of interest to programmers in the process of certain programming tasks.. (such as where I am at right now) – Marcus Leon – 2011-08-31T00:47:23.810

It's not a programming question, and therefore is off-topic. You've been here long enough to know that, but if not please read the FAQ for info on what questions are on-topic here. :)

– Ken White – 2011-08-31T00:49:51.783

You could of course use grep with a negated character class. – Tom Zych – 2011-08-31T00:59:14.190

Anything that isn’t going to go the route of grep '[^\x00-\xFF]' or its moral equivalent using existing tools not writing a new program is nothing but insane overkill. – tchrist – 2011-08-31T02:17:23.327

@tchrist, good point. Though I'm having an issue with that - http://stackoverflow.com/questions/7258299/grep-regex-doesnt-work-with-cygwin-on-windows

– Marcus Leon – 2011-08-31T14:47:42.667

Use grep -P '[^\x00-\xFF]' or perl -ne 'print if /[^\x00-\xFF]/'. Note that grep’s -P option doesn’t actually accept real Perl regexes. – tchrist – 2011-08-31T18:45:20.017

@tchrist: Doesn't ASCII run from 00 to 7F? – Tom Zych – 2011-09-02T00:49:31.217

@Tom: Yup. I was just mimicking what the OP did, which I late realized didn't make sense. – tchrist – 2011-09-02T01:31:57.320

Answers

Well, it's still here after an hour, so I may as well answer it. Here's a simple filter that prints only non-ASCII characters from its input, and gives exit code 0 if there weren't any and 1 if there were. Reads from standard input only.

#include <stdio.h>
#include <ctype.h>

int main(void)
{
    int c, flag = 0;

    while ((c = getchar()) != EOF)
        if (!isascii(c)) {
            putchar(c);
            flag = 1;
        }

    return flag;
}

Tom Zych

Posted 2011-08-31T00:35:06.527

Reputation: 921

Thanks, happen to have a Java version? :) – Marcus Leon – 2011-08-31T01:56:13.587

Nope, don't do Java, sorry. – Tom Zych – 2011-08-31T01:58:35.293

1@Marcus: Monolingualism is about as environmentally healthy as any other monoculture. – tchrist – 2011-08-31T02:21:19.390

Just run $JDK_HOME/bin/native2ascii on the text file and search for "\u" in the output file. I'm assuming you want to find it so you can escape it anyway and this will save you a step. ;)

jonathan.cone

Posted 2011-08-31T00:35:06.527

Reputation: 208

I have no idea if this is legit, casting each char to an int and using a catch to identify things that fail. I'm also too lazy to write this in java so have some Groovy

def chars = ['Ã', 'a', 'Â', 'ç', 'x', 'o', 'Ð'];

chars.each{
    try{ def asciiInt = (int) it }
    catch(Exception e){ print it + " "}
}

==> Ã Â ç Ð

awfulHack

Posted 2011-08-31T00:35:06.527

Reputation: 101

In Java (assuming the string is specified as the first command-line argument:

public class Main
{
    public static void main(String[] args)
    {
        String stringToSearch = args[0];
        int len = stringToSearch.length();
        for (int i = 0; i < len; i++)
        {
            char ch = stringToSearch.charAt(i);
            if (ch >= 128) // non-ascii
            {
                System.out.print(ch + " ");
            }
        }
        System.out.println();
    }
}

To make this your own, replace stringToSearch with whatever you need.

Nathan Moos

Posted 2011-08-31T00:35:06.527

Reputation: 101

A simple groovy example:

def str = [ "this doesn't have any unicode", "this one does ±ÁÎ˜Â·€ÔÅ" ]

str.each {
    if( it ==~ /[\x00-\x7F]*/ ) {
        println "all ascii: $it"
    } else {
        println "NOT ASCII: $it"
    }
}

It's as simple as this bit here: it ==~ /[\x00-\x7F]*/

Edit: I forgot to include a version for files. Oops:

def text = new File(args[0]).text
if( text ==~ /[\x00-\x7F]*/ ) {
    println "${args[0]} is only ASCII"
    System.exit(0)
} else {
    println "${args[0]} contains non-ASCII characters"
    System.exit(-1)
}

That version can be used as a command line script, and includes an exit status so it can be chained.

OverZealous

Posted 2011-08-31T00:35:06.527

Reputation: 109

It doesn’t make any sense to read the whole file into memory. Note that EVERY SINGLE STRING EVER CREATED matches something like /[\x00-\xFF]*/, just as every single string also matches /a*/, even "xxx". Zero or more means you’re content with 0. And /[\x80-\xFF]/ is not ASCII! You need to match /^[\x00-\x7F]+$/ to be all ASCII. A normal regex engine with the very most basic Unicode support would simply use \p{ASCII} vs \P{ASCII}. – tchrist – 2011-08-31T18:49:09.310

@tchrist I appreciate the feedback. Of course, it would be more efficient to stream the file. However, the original question specifically asked about scanning a small file. Your comment about the regex is incorrect, simply due to the fact that I actually tested my code before I posted it. Sorry if my range is incorrect - that might be a valid comment, but your comment is unnecessarily aggressive and rude. I was simply providing a working Groovy-based example, since the question mentioned it. – OverZealous – 2011-09-01T03:36:12.470

Also, you have to match the empty string, or empty files will show up as non-ASCII. I think that is incorrect behavior. – OverZealous – 2011-09-01T03:38:44.590

Nop, ASCII is code points 0 through 127. Your pattern matches 0 through 255. It is therefore wrong. – tchrist – 2011-09-01T11:48:40.043

I shouldn't bother responding, but I need to point out two things: First, you could have simply pointed that out, and suggested a fix, and I would have updated my suggestion. That's how StackExchange works - answers can be edited and cleaned up. Second, it's funny you are making such a big deal about the range, since that's the exact same range you suggested above! It's OK though, I understand that you would rather knock someone down than be helpful. – OverZealous – 2011-09-01T17:54:35.007