In utf-8 collation, why 11- is less then 1-?

7

I found that the sort result in ASCII:

Source file test:

1-
11-
1-a
11-a

Sort using ASCII:

$ LANG=en_US.ascii sort test
1-
1-a
11-
11-a

And using UTF-8:

$ LANG=en_US.utf8 sort test
1-
11-
11-a
1-a

I feel it's so counter-intuitive, and it's not dictionary order.

Isn't the character '-' (002d) is always less then [0-9] (0030-0039)? What's the general rule in UTF-8 collation?

And how to bypass it, just make - be less then [0-9] while keep other characters unchanged for UTF-8, in Linux? (So it can affects the result of ls --sort, sort, etc. )

Xiè Jìléi

Posted 2011-01-01T13:32:38.110

Reputation: 14 766

@grawity I see this on gmail when I open zip files. I see this in Win7 with images: 11, 12, 13, ..., 19, 1. – Wolfpack'08 – 2014-07-23T17:23:03.187

3Where precisely are you seeing this? With sort 8.5 from GNU coreutils, "1-" always comes before "11-", with any locale. – user1686 – 2011-01-01T15:17:59.330

It's my mistake. I have truncated the strings. I changed the example please try again. – Xiè Jìléi – 2011-01-01T18:39:12.703

Answers

6

The minus sign is ignored in the first pass. So the first pass sorts 1, 11, 1a, 11a. Since 1 < a, you get 11a < 1a and thus 11-a < 1-a.

- is a variable collation element, meaning that you/the implementor can choose to ignore it. The glibc implementation apparently does so. In practice, most punctuation is affected by this behavior.

You can read up on the gory details in the Unicode Collation Algorithm, modulo how glibc implements it.

Peter Eisentraut

Posted 2011-01-01T13:32:38.110

Reputation: 6 330

Then, is there any configuration to glibc to suppress this ignorance? – Xiè Jìléi – 2011-01-02T17:53:45.613

Not that I'm aware of. – Peter Eisentraut – 2011-01-02T19:05:17.083

0

As explained by Peter Eisentraut, this is because the sorting algorithm for Unicode ignores - when sorting.

The only way around this is to define your own locale, with different collation (sorting rules). This is however rather non-trivial. Also, it would give you a system with unusual sorting rules, which may cause problems with other software.

So realistically, you'll either have to switch your locale to ASCII (if you don't need Unicode character), or sort using a program where you can configure the sorting rules directly.

sleske

Posted 2011-01-01T13:32:38.110

Reputation: 19 887