awk character differences when using if

1

I have an input file with this line(user data/columns stripped out) and several thousand more. The xCE is an unconverted hex value from the clients file.

412640  xCE

When I run it thru this awk command:

awk -F'\t' '{if ($1 == "412640" ) print $1 "\t" $2}' TEST.txt > test1.txt

the output in test1.txt has converted xCE to Î, which is what I want to happen.

When I run the entire file with out the if, so this command:

awk -F'\t' '{print $1 "\t" $2}' TEST.txt > test2.txt

the output in test2.txt still has xCE in it, and when I tried:

awk -F'\t' '{if ($1 == $1 )print $1 "\t" $2}' TEST.txt > test2.txt

the output in test2 still has xCE in it.

Any advice on how to always get the converted output?

I'm using:: GNU Awk 3.1.7 My codepage is UTF-8 On redhat 6.7

EDIT: After a bunch more unit testing of both the 'good'/'bad' awk commands, I can't always replicate the 'bad' output. The larger the total rowcount, the less likely it is to convert the hexvalues, but it's not 100% of the time. I'm looking into trying to control the size of the buffer for awk now, on the assumption that it has to do with writing straight from buffer to the output vs writing to internal temp files when it needs the buffer for other things.

mike ray

Posted 2015-10-13T17:29:10.570

Reputation: 31

I ran your if ($1 == "412640" ) command for the line provided. It outputs nothing. Please add a link to a test file with some lines inside, the system on which you are running and the version of Awk. GNU Awk 4.0.1. – Hastur – 2015-10-13T17:39:10.373

Hauster, I'm guessing you have spaces instead of a tab between the two columns. Is there a way to upload files to superuser? – mike ray – 2015-10-13T17:41:42.810

2How is print $1 "\t" $2 supposed to convert xCE to Î? – Steven – 2015-10-13T17:44:16.337

I've updated the question to include the awk/linux/codepage, and to explain the xCE is an uncoverted character from the client file. – mike ray – 2015-10-13T17:50:02.390

I tried uploading a sample thru google docs, but it kept on being 'helpful', and converting the bad character for me... – mike ray – 2015-10-13T20:05:58.520

Answers

1

Try something around

 awk '{ printf("%c \n" ,strtonum("0x" substr($2,2)))}' TEST.txt 

of course modify the printf expression to your needs, adding the if the $2 ...

Hastur

Posted 2015-10-13T17:29:10.570

Reputation: 15 043

The example above was just an example that removed all client data, I can't predict where the bad hexvalues will show up, and client data can be millions of rows. I'd have to strtonum every single character if I understand your suggestion correctly. – mike ray – 2015-10-13T20:04:59.837

@mikeray: Sorry I was really in a hurry; the proposed one was only an hint for that current situation (when you said "Any advice on how to always get the converted output?"). BTW if the input is not always so it is a different case. IMHO you should try to fix the input when the clients create it. So you can fix one time the past situation and continue with a standard work flow. Since in my neighborhood we start to be short with fairies I'm afraid you will continue to deal with a not uniform input. :-) Thus the need to scan it all... use if to skip unnecessary operations... – Hastur – 2015-10-14T08:28:31.703