0
I have a large number of large semicolon-separated data files. All string fields are surrounded by double quotes. In some of the files, there are extra quotes in the string fields, which messes up the subsequent importing of the data for analysis (I'm importing to Stata).
This code allows me to see the problematic quotes using gnu-awk:
echo '"This";"is";1;"line" of" data";""with";"extra quotes""' | awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")"}; {for ( i=1 ; i<=NF ; i++ ) if ($i ~ /^"(.*".*)+"$/) {print NR, $i}}'
1 "line" of" data"
1 ""with"
1 "extra quotes""
but I do not know how to replace them.
I was thinking of doing the replace manually, but it turns out that there are several hundred matches in some of the files. I know about awk's -sub-, -gsub-, and -match- functions, but I am not sure how to design a search and replace for this specific problem.
In the example above, the respective fields should be "This"
, "is"
, 1
, "line of data"
, "with"
, "extra quotes"
, that is: all semicolons are separators, and all quotes except for the outermost quotes should be removed.
Should I may be use -sed-, or is -awk- the right tool? Hope you can help me out!
Thanks,
Matthijs
The case
""with";"extra quotes""
is not clear: your awk script prints out it as two separate lines, while your explanation tells that you would like finally get it as "with;extra quotes". Am I correct? – Serge – 2012-10-09T07:48:20.690@Serge No, the field delimiting is correct. I want all quotes except for the ones at the beginning or end of the field removed. It should be two fields:
"with"
and"extra quotes"
. – Matthijs – 2012-10-09T09:15:00.613The -gensub- in -awk- works for me, except for some pathetic cases.
echo '"This";"is";1;"line" of" data";""with";"extra quotes""' | awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")"}; {print gensub(/([^;])["]+([^;])/, "\\1\\2", "g")};'
. An example of a pathetic case would be"ab""c"d"
. For now, though, this solution is fine for me. – Matthijs – 2012-10-09T09:27:01.003then could you please clarify. 1) Is this correct that wherever you encounter a semicolon it is a field separator? Do you want all quotes to be removed and only 'outbound' quotes to remain for any field satisfying the rule 1) and that has any quotes? – Serge – 2012-10-09T11:02:19.247
@Serge Edited the question, hopefully more clear now. – Matthijs – 2012-10-09T12:13:15.163