Unable to get accurate record count of a csv file

I have a .csv file that has some values formatted as paragraphs with line breaks or sometimes bullet points.

"STAT","ID","DESC"
"UPD", "1", "Updated"
"CHG", "2", "Changed"
"UPD", "3", "Updated.
Might have to update again"
"UPD", "4", "Updated.

 - once 
 - twice
 - thrice
"
"DEL", "5", "unknown"
"DEL", "6", "Deleted
Need to restore"

I need to count the number of records, used awk like below since I knew 2nd column is unique id but returns more than I have. Above text is ofcourse dummy as I'm not allowed to share original but I tried to reflect as closely as I can.

 awk  '{print $2}' FS=","  sample.csv | wc -l

I even printed out first column using awk '{print $1}' to check just the first column values but the output shows the starting parts of new line in paragraphs.

Please let me know if any additional info is needed and I'll update the question.

vchollati

Posted 2015-01-12T23:04:17.593

Reputation: 3

What are you doing with result? Must you use awk? If you just want the info, try nirsoft csvfileviewer – yosh m – 2015-01-12T23:07:39.410

I need to count the number of records through unix command line, as part of automating a testing task (count number of records everytime we get an updated csv from a vendor). I can open in excel and see the correct number once but since I need to automate this process in unix commandline I'm trying awk. I don't have access to any other tools/packages except the ones that are part of commandline. Like sed, grep and awk. – vchollati – 2015-01-12T23:11:49.147

Answers

A way to do this with awk is

awk -v RS=$'"\n"' 'END {print NR}' sample.csv

RS=$'"\n"' sets the Record Separator (which is newline by default) to the three-character string ", newline, ". This syntax might work only in bash. This will cause your file to be decomposed into the following records:

1: "STAT","ID","DESC

2: UPD", "1", "Updated

3: CHG", "2", "Changed

4: UPD", "3", "Updated.
Might have to update again

5: UPD", "4", "Updated.

- once
- twice
- thrice

6: DEL", "5", "unknown

7: DEL", "6", "Deleted
Need to restore"

This assumes that there are no trailing blanks in the file.
'END {print NR}' reads the file until the end and then prints the record number – in other words, the number of records.

Text files are generally considered to consist of a sequence of lines, delimited by newline characters or character sequences. And, generally, a “record” in a text file is considered to be one line. But awk lets you specify a record separator other than newline. Since the quote-newline-quote string appears between each pair of consecutive records in your file, specifying it as your record separator breaks the file into (very nearly) the records that you want.

But the record separator is like the wall between two rooms – it isn’t part of either one. In normal awk processing, you see records that are lines without newline characters – they are removed. Similarly, in my answer, the quote-newline-quote sequences are removed. But, since there isn’t a record separator before the first record or after the last one, the very first and very last quote characters are not removed.

If you want to process the file, one record at a time, this solution might not be good enough, because the first record and last records are treated differently. I agree (somewhat) with Glenn’s recommendation that, for any serious work, you should use a “proper CSV parser”.

G-Man Says 'Reinstate Monica'

Posted 2015-01-12T23:04:17.593

Reputation: 6 509

I don't quiet understand how eliminating the endquote and startquote for each record (except 1st and last records) got the count right, the solution nevertheless worked. – vchollati – 2015-01-14T18:35:35.973

Eliminating the endquote and startquote for each record (except 1st and last records) isn’t the answer; it’s a side-effect of the answer. “I don’t understand how it works ...” ... See edits. – G-Man Says 'Reinstate Monica' – 2015-01-14T23:18:44.273

Explanation clarified it, much appreciated. – vchollati – 2015-01-17T20:58:21.377

I highly recommend picking a language with a proper CSV parser. I like ruby for this, it's very concise:

ruby -rcsv -e 'a = CSV.read(ARGV[0], :col_sep => ", "); puts a.length' file

I had to modify the column separators in the header line to add spaces.

glenn jackman

Posted 2015-01-12T23:04:17.593

Reputation: 18 546

Okay, just checked online and Ruby might be installed by default on most linux distributions. Will check tomorrow and update the post. Thank you. – vchollati – 2015-01-13T03:17:26.807