Sorting is not consistent using the Unix command 'sort'

1

0

I'm running the command:

zcat [File] | sed "1d" | sort -t $'\xE7' -k [field to be sorted] > [file].sorted

When I run this on File A, sorting on field 1, I get the following result:

11622400 , abe, def
11622401 , abe, def
11622402 , bbabe, def
11622403 , ddabe, def
11622404 , acdc, dere
11622405 , ddabe, bere
11622406 , abe, fgh
11622407 , adbed, ddee
11622408 , adbe, def
11622409 , abdde, def
1162240 , abe, deed
11622410, def,dede

But when I run the same command on the file 2 sorting on field 2, I get this:

1162303, 116224
1162420, 1162240
11623062, 11622400
11623063, 11622401
11623064, 11622402
11623065, 11622403
11623066, 11622404
11623067, 11622405
11623068, 11622406
11623069, 11622407
11623070, 11622408
11623071, 11622409
1162421, 1162241
11623072, 1162410

Why is it not sorting in the same way? The first example looks wrong, the second line from the bottom should be at the top.

I'm trying to join these files with the Unix join command, but because these are not ordering in the same way, this is missing out lots of records.

What is the reason for this problem?

mattm123

Posted 2010-10-25T12:58:19.403

Reputation: 113

1note too that your 'join' requirement is buried too deep in the question - your stated issue is that "sort doesn't work", which is incorrect. In truth, it's just not working the way you would like it to work. – KevinDTimm – 2010-10-22T16:15:18.240

4File A? file 2? The question is hopeless underspecified, not about programming and ill stated as your title concerns only sort, but at the very end of the question text you mention join. And, you've told us about what you assume the problem to be without telling us what you are trying to achieve. – dmckee --- ex-moderator kitten – 2010-10-23T19:06:29.263

Answers

2

The reason you're getting these results is that your sort is not numeric, it is based upon canonical values of the columns.

There is a command line switch to sort that will sort numerically, this is what you want (type 'man sort' in your google bar)

KevinDTimm

Posted 2010-10-25T12:58:19.403

Reputation: 220

Thanks for that Kevin but this is unfortunaly not the answer as the Unix join command requires the sort to be performed lexically. – None – 2010-10-22T14:58:10.683

I wonder if it has anythign to do with the Locale settings – None – 2010-10-22T14:58:31.587

post a subset of your input file (pastebin maybe?) and then migrate the question to superuser.com (for example) – KevinDTimm – 2010-10-22T16:10:45.803

2

There's something wrong with your question: you claim to use $'\xE7' as the record separator, but that byte doesn't appear in the file. If this is really the command you ran and these are really your outputs, then file A was sorted based on the whole line and file B was sorted randomly (all fields 2 are empty, and sort is not stable by default). However, since file 2 does look sorted on the second “, ”-separated field in your output from file B, I guess this is a bug in your question and either your code used a space or comma as separator or your data contains the byte E7 where your data here has a comma and a space.

If you do pass a -t option to set a separator for sort, you must pass the same separator to join. In any case, you need to tell join which columns to join. For example:

<a.input sort -t $'\xE7' -k1 >a.sorted
<b.input sort -t $'\xE7' -k2 >b.sorted
join -1 1 -2 2 -t $'\xE7' a.sorted b.sorted >joined

Furthermore, given that “11622409 ,” appears before ”1162240 , ”​ in your output from file A, it appears is that you're running sort in a locale that produces results approaching human sorting rules (only approaching, because sort is not refined enough to match the fairly complicated rules used in serious typography). You will get less surprising results if you change your locale to one that produces results suitable for computer consumption. In practice, that means your LC_COLLATE setting should be C (or its synonym POSIX). (Any other locale tends to break scripts that use sort, though yours should in fact be ok.) Example:

$ cat a
11622409 , abdde, def
1162241 , abe, deed
11622410, def,dede
$ LC_COLLATE=en_US sort <a
11622409 , abdde, def
11622410, def,dede
1162241 , abe, deed
$ LC_COLLATE=C sort <a
11622409 , abdde, def
1162241 , abe, deed
11622410, def,dede

If you're running join in the same locale as sort, you should be ok. Note that sort produces lexically sorted output, not numerically sorted; but that is what you want as the input to join.

Gilles 'SO- stop being evil'

Posted 2010-10-25T12:58:19.403

Reputation: 58 319

0

Try:

zcat [File] | sed "1d" | sort -tn $'\xE7' -k [field to be sorted] > [file].sorted

jet

Posted 2010-10-25T12:58:19.403

Reputation: 1