1
I've had a very colorful morning learning the innerparts of Linux's sort
command, and have come across yet another issue that I can't seem to find an answer for in the documentation.
I'm currently using -t,
to indicate that my fields are split by the comma character, but I'm finding that in some of my files, the comma is used (between double-quotes) within values:
Jonathan Sampson,,foo@bar.com,0987654321
"Foobar CEO,","CEO,",ceo@foobar.com,,
How can I use a comma to terminate my fields, but ignore the occurences of it within values? Is this fairly simple, or do I need to re-export all of my data using a more-foreign field-terminator? (Unfortunately, I do not have any control over declaring a different terminator with this particular project).
Unfortunately I don't have the ability to control the field-terminator. Perhaps I need to replace all commas appearing between double-quotes via a regular expression (would need to ask a new question), and then do my sort. – Sampson – 2010-05-18T15:54:11.800
@Jonathan: I would recommend using a Perl module ( Text::CSV , Text::CSV_XS ) or Python lib ( csv ) that understands CSV files and can convert to tab-delimited rather than try to create a regex. There are too many gotchas.
– Paused until further notice. – 2010-05-18T16:33:28.007I'm opened to just about any suggestion at this point. Please note that I'm a juvenile linux user, but very comfortable with regex syntax. I'd appreciate any solid direction you can offer. – Sampson – 2010-05-18T16:35:23.603
@Jonathan:
sed
might do the job, but it would take some complex gyrations if it's possible at all, but it can't do lookarounds. I don't do Perl but there's probably "an app for that" somewhere among the shifted numeral keys on your keyboard. Do you do Perl? Python? Take a look on SO for questions about converting or reading CSV files. A quickie hack might be to just import into Excel and export back out to tab-delimited (I'm pretty sure it can do that). – Paused until further notice. – 2010-05-18T16:47:56.897