Input field separator in awk

2

1

I have many large data files. The delimiter between the fields is a semicolon. However, I have found that there are semicolons in some of the fields, so I cannot simply use the semicolon as a field separator.

The following example has 4 fields, but awk sees only 3, because the '1' in field 3 is stripped by the regex (which includes a '-' because some of the numerical data are negative):

echo '"This";"is";1;"line of; data"' | awk -F'[0-9"-];[0-9"-]' '{print "No. of fields:\t"NF; print "Field 3:\t" $3}'
No. of fields:  3
Field 3:        ;"line of; data"

Of course,

echo '"This";"is";1;"line of; data"' | awk -F';' '{print "No. of fields:\t"NF}'
No. of fields:  5

solves that problem, but counts the last field as two separate fields.

Does anyone know a solution to this?

Thanks!

Matthijs

Matthijs

Posted 2012-10-05T20:43:04.850

Reputation: 157

Answers

2

You will need to use GNU awk and FPAT:

awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")" } { for (i=1; i<=NF; i++) print $i }'

Testing:

echo '"This";"is";1;"line of; data"' | awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")" } { for (i=1; i<=NF; i++) print $i }'

Results:

"This"
"is"
1
"line of; data"

Steve

Posted 2012-10-05T20:43:04.850

Reputation: 511

1+1 nice answer. I was just about to suggest using a different language with a CSV module. – glenn jackman – 2012-10-06T13:48:38.627