Read CSV using pandas with values enclosed with double quotes and values have comma in column

5

I need to read a file in python pandas of the following type

"column1","column2","column3","column4"
"value1","value,1","value2","value3"
"value5","value6","value7","value8"
"value32","value21","value,31","value,44"

I tried using

file1 = pd.read_csv('sample.txt',sep=',\s+',skipinitialspace=True,quoting=csv.QUOTE_ALL,engine=python)

it says something like ValueErro(Expected some lines got something else ) not exactly

I need to read a large CSV file of this type and load it to dataframe. what changes should i make to read it correctly.

Ajay K S

Posted 2017-03-03T10:20:22.937

Reputation: 51

I think you need to use sep=',\s*' instead of sep=',\s+'. As about comma inside quoted value (as it is a case for "value,31") it comply with rfc4180 and shouldn't be an issue – Alex – 2017-03-03T11:01:31.270

Earlier it was showing**ValueError('Expected 1 fields in line 328, saw4',) **and after changing it to * it shows ValueError('Expected 1 fields in line 328, saw6',) – Ajay K S – 2017-03-03T12:00:26.787

It looks like the issue with source data. Check that line 328 in source data file – Alex – 2017-03-03T12:27:50.210

I am sorry i haven't mentioned about that, I have checked it and found that there is an extra comma inside double quotes. i removed it manually and the code works fine. But i cannot do this all time, how can I change code to handle the situation. There is another problem that inside the double quotes for one value there was another " " it also make the program to exit. – Ajay K S – 2017-03-03T12:43:35.267

comma inside double quotes is Ok. As about " " - you need to clean up source file before processing. If double quotes stay together as "" it shouldn't be an issue too because it comply with CSV standard, it calls escaped double quotes. If there is a space between double quotes then run sed -r 's/\"\s+\"/\"\"/g' src.csv >cleared.csv before you feeding CSV to pandas. It will remove space between quotes or run sed -r 's/\"\s+\"//g' src.csv >cleared.csv to remove internal quotes completely – Alex – 2017-03-03T13:09:13.557

Thanks @Alex, i cleaned the data data and now it works well. Thanks for the sed. – Ajay K S – 2017-03-03T14:12:50.370

No problem, glad I was able to help you. I summarized everything in answer, so may be someone will find it helpful too – Alex – 2017-03-03T14:41:19.907

single quote is missing pd.read_csv( ...... engine='python') – Dipankar Nalui – 2018-11-23T10:01:13.597

Answers

2

Use in python pandas sep=',\s*' instead of sep=',\s+', it will make space(s) optional after each comma:

file1 = pd.read_csv('sample.txt',sep=',\s*',skipinitialspace=True,quoting=csv.QUOTE_ALL,engine='python')

Comma inside double quotes is Ok, it's allowed by rfc4180 standard.
As about " " inside of data values (such as "value" "13") - you will need to clean up source file before processing. If double quotes stay together as "" it shouldn't be an issue because it comply with CSV standard, it calls escaped double quotes, but if there is a space between double quotes then you need to clean it up

Use:

sed -r 's/\"\s+\"/\"\"/g' src.csv >cleared.csv 

before you feeding CSV to pandas. It will remove space between quotes or run

sed -r 's/\"\s+\"//g' src.csv >cleared.csv 

to remove internal quotes completely.

Alex

Posted 2017-03-03T10:20:22.937

Reputation: 5 606

Alex, where do you run sed -r 's/"\s+"//g' src.csv >cleared.csv? Is this a command line piece of code? – DataGirl – 2018-04-20T20:29:26.977

@DataGirl Yes it should be run from command line, in terminal on Unix base system or if you on Windows then you need to install cygwin to use "sed" program – Alex – 2018-07-18T14:46:10.013

sep=',\s*' seems to break using quotechar='"', quoting=csv.QUOTE_ALL – Frobbit – 2018-09-24T21:30:33.457

single quote is missing pd.read_csv( ...... engine='python') – Dipankar Nalui – 2018-11-23T09:59:43.460

Fixed. Tnx Dipankar – Alex – 2018-12-31T11:27:20.227

2

No need to preprocess csv file, just use engine type python :

dataset = pd.read_csv('sample.csv', sep=',', engine='python')

Deepak

Posted 2017-03-03T10:20:22.937

Reputation: 121

That didn't work for me, I'm still seeing single quotes in my dataframe. – DataGirl – 2018-04-20T20:32:03.720

1

@DataGirl try adding skipinitials = True to your pd.read_csv(...). As you can see here: https://stackoverflow.com/a/37076344/3286178

– rwenz3l – 2018-06-15T09:16:15.220

As pointed out by @rwenz3l, but the syntax is dataset = pd.read_csv('sample.csv', sep=',', skipinitialspace=True) – yoonghm – 2018-09-18T15:28:40.143

0

I "think" these two commands, with single quotes and double quotes between the file name, should work the same way (or not?):

pd.read_csv('filename.csv')

or this

pd.read_csv("filename.csv")

Marco

Posted 2017-03-03T10:20:22.937

Reputation: 1

yes, but this isn't related to the question – Andrew Louw – 2020-01-08T09:41:47.447