Replacing quotes in a file

0

I have a large number of large semicolon-separated data files. All string fields are surrounded by double quotes. In some of the files, there are extra quotes in the string fields, which messes up the subsequent importing of the data for analysis (I'm importing to Stata).

This code allows me to see the problematic quotes using gnu-awk:

echo '"This";"is";1;"line" of" data";""with";"extra quotes""' | awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")"}; {for ( i=1 ; i<=NF ; i++ ) if ($i ~ /^"(.*".*)+"$/) {print NR, $i}}'
1 "line" of" data"
1 ""with"
1 "extra quotes""

but I do not know how to replace them.

I was thinking of doing the replace manually, but it turns out that there are several hundred matches in some of the files. I know about awk's -sub-, -gsub-, and -match- functions, but I am not sure how to design a search and replace for this specific problem.

In the example above, the respective fields should be "This", "is", 1, "line of data", "with", "extra quotes", that is: all semicolons are separators, and all quotes except for the outermost quotes should be removed.

Should I may be use -sed-, or is -awk- the right tool? Hope you can help me out!

Thanks,

Matthijs

Matthijs

Posted 2012-10-09T07:10:47.373

Reputation: 157

The case ""with";"extra quotes"" is not clear: your awk script prints out it as two separate lines, while your explanation tells that you would like finally get it as "with;extra quotes". Am I correct? – Serge – 2012-10-09T07:48:20.690

@Serge No, the field delimiting is correct. I want all quotes except for the ones at the beginning or end of the field removed. It should be two fields: "with" and "extra quotes" . – Matthijs – 2012-10-09T09:15:00.613

The -gensub- in -awk- works for me, except for some pathetic cases. echo '"This";"is";1;"line" of" data";""with";"extra quotes""' | awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")"}; {print gensub(/([^;])["]+([^;])/, "\\1\\2", "g")};'. An example of a pathetic case would be "ab""c"d". For now, though, this solution is fine for me. – Matthijs – 2012-10-09T09:27:01.003

then could you please clarify. 1) Is this correct that wherever you encounter a semicolon it is a field separator? Do you want all quotes to be removed and only 'outbound' quotes to remain for any field satisfying the rule 1) and that has any quotes? – Serge – 2012-10-09T11:02:19.247

@Serge Edited the question, hopefully more clear now. – Matthijs – 2012-10-09T12:13:15.163

Answers

1

My own solution, using only sed, stripping away all semicolons that are not juxtaposed with delimiters or numerical fields (the awk command is only there for expositional clarity):

echo '"This";"is";1;"line" of" data";""without";"extra quotes""' | sed -E 's/([^;])"+([^;])/\1\2/g' | awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")"}; {for ( i=1 ; i<=NF ; i++ ) print $i}'
"This"
"is"
1
"line of data"
"without"
"extra quotes"

I think it's faster since it works on the full lines rather than splitting the line up in fields.

Matthijs

Posted 2012-10-09T07:10:47.373

Reputation: 157

Yes a good simple solution; why didn't I think of that? You can use time to compare solutions and see what's faster. You may be surprised, sometimes these global substitutions can take a while. But if I were you, I'd probably still accept this answer based on it's simplicity. – Steve – 2012-10-10T09:53:29.017

One issue is an empty text field at the end of the line (i.e., ";"""). My solution strips the last quote. I use sed -E 's/;"([^"]*)$/;"\1"/g' to fix that; takes another run though. – Matthijs – 2012-10-10T13:15:06.093

1

One way using GNU awk and FPAT:

awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")" } { for (i=1; i<=NF; i++) if (substr($i,0,1) == "\"" && substr($i,length($i),1) == "\"") { gsub(/"/, "", $i); printf "\"%s\"\n", $i } else { gsub(/"/, "", $i); print $i } }'

Testing:

echo '"This";"is";1;"line" of" data";""with";"extra quotes""' | awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")" } { for (i=1; i<=NF; i++) if (substr($i,0,1) == "\"" && substr($i,length($i),1) == "\"") { gsub(/"/, "", $i); printf "\"%s\"\n", $i } else { gsub(/"/, "", $i); print $i } }'

Results:

"This"
"is"
1
"line of data"
"with"
"extra quotes"

One way using GNU awk, FPAT and GNU sed:

sed -e '/^".*"$/ { s/"//g; s/.*/"&"/ }' -e '/^".*"$/!s/"//g'

Testing:

echo '"This";"is";1;"line" of" data";""with";"extra quotes""' | awk 'BEGIN { FPAT = "([^;]+)|(\"[^\"]+\")" } { for (i=1; i<=NF; i++) print $i }' | sed -e '/^".*"$/ { s/"//g; s/.*/"&"/ }' -e '/^".*"$/!s/"//g'

Results:

"This"
"is"
1
"line of data"
"with"
"extra quotes"

Steve

Posted 2012-10-09T07:10:47.373

Reputation: 511

Personally, I like the sed pipe. It's surprisingly easier to read IMO. Both solutions work the same; If the line contains beginning and end quotes, remove them all. And add them back in at print time. Otherwise just remove them all. HTH. – Steve – 2012-10-09T13:38:49.113

Thanks, nice solutions. Going through all the fields of all the lines with awk takes some time though, this being one of many data files between 100-1000MB in size. – Matthijs – 2012-10-10T09:10:42.970

1

I would rather go with coreutils and sed (GNU versions):

<<< '"This";"is";1;"line" of" data";""with";"extra quotes""' \
| tr ';' '\n' | sed -r 's/(.)"(.)/\1\2/g' | tr '\n' ';'

Output:

"This";"is";1;"line of data";"with";"extra quotes";

It leaves an extra semicolon, and drops the newline, insert head -c -1 before the second tr and append ; echo to fix:

tr ';' '\n' | sed -r 's/(.)"(.)/\1\2/g' | head -c -1 | tr '\n' ';'; echo

Output:

"This";"is";1;"line of data";"with";"extra quotes"

Thor

Posted 2012-10-09T07:10:47.373

Reputation: 5 178