When your field-terminating char appears within field values

1

I've had a very colorful morning learning the innerparts of Linux's sort command, and have come across yet another issue that I can't seem to find an answer for in the documentation.

I'm currently using -t, to indicate that my fields are split by the comma character, but I'm finding that in some of my files, the comma is used (between double-quotes) within values:

Jonathan Sampson,,foo@bar.com,0987654321
"Foobar CEO,","CEO,",ceo@foobar.com,,

How can I use a comma to terminate my fields, but ignore the occurences of it within values? Is this fairly simple, or do I need to re-export all of my data using a more-foreign field-terminator? (Unfortunately, I do not have any control over declaring a different terminator with this particular project).

Sampson

Posted 2010-05-18T15:28:24.270

Reputation: 7 244

Answers

0

I think sort is completely simple-minded about field delimiters. A good candidate might be a tab.

Paused until further notice.

Posted 2010-05-18T15:28:24.270

Reputation: 86 075

Unfortunately I don't have the ability to control the field-terminator. Perhaps I need to replace all commas appearing between double-quotes via a regular expression (would need to ask a new question), and then do my sort. – Sampson – 2010-05-18T15:54:11.800

@Jonathan: I would recommend using a Perl module ( Text::CSV , Text::CSV_XS ) or Python lib ( csv ) that understands CSV files and can convert to tab-delimited rather than try to create a regex. There are too many gotchas.

– Paused until further notice. – 2010-05-18T16:33:28.007

I'm opened to just about any suggestion at this point. Please note that I'm a juvenile linux user, but very comfortable with regex syntax. I'd appreciate any solid direction you can offer. – Sampson – 2010-05-18T16:35:23.603

@Jonathan: sed might do the job, but it would take some complex gyrations if it's possible at all, but it can't do lookarounds. I don't do Perl but there's probably "an app for that" somewhere among the shifted numeral keys on your keyboard. Do you do Perl? Python? Take a look on SO for questions about converting or reading CSV files. A quickie hack might be to just import into Excel and export back out to tab-delimited (I'm pretty sure it can do that). – Paused until further notice. – 2010-05-18T16:47:56.897

0

CSV format allows for field to contain commas, provided that the field is quoted:

field1, "field2,with,commas"  ,  field3  ,  "field4,foo"

We have a mixture of quoted and unquoted fields here, which cannot parsed directly by any value of FS (that I know of, at least). However, we can still get the fields using match() in a loop (and cheating a bit):

$0=$0",";                                  # yes, cheating

while($0) {

  match($0,/[^,]*,| *"[^"]*" *,/);            

  sf=f=substr($0,RSTART,RLENGTH);          # save what matched in sf

  gsub(/^ *"?|"? *,$/,"",f);               # remove extra stuff

  print "Field " ++c " is " f;

  sub(sf,"");                              # "consume" what matched

}

As the complexity of the format increases (for example when escaped quotes are allowed in fields), awk solutions become more fragile. Although I should not say this here, for anything more complex than the last example, I suggest using other tools (eg, Perl just to name one). Btw, it looks like there is an awk CSV parsing library here: http://lorance.freeshell.org/csv/ (I have not tried it).

rch1231

Posted 2010-05-18T15:28:24.270

Reputation: 21