Why does 'sort' ignore special characters, like the asterisk?

28

2

I thought that sort would sort common prefixes together but that doesn't always happen. Take this input for example:

AT0S*eightieths
AT0S*eyetooth's
AT*ad
AT*Ad
AT*AD
AT*Eydie
AT*eyed
ATF*adv
ATF*ATV
ATF*edify
ATF*Ediva
ATFKT*advocate
ATFKTNK*advocating
ATFKT*outfought
ATFKTS*advocates
ATHT*whitehead
ATHT*Whitehead
AT*id
AT*I'd
AT*Ito
AT*IUD
ATJ*adage
ATNXNS*attention's
ATNXNS*attenuation's
ATNXNS*autoignition's
AT*oat
AT*OD
AT*outweigh
AT*owed
ATP0K*idiopathic
ATP*adobe
ATT*wighted
ATT*witted
ATT*wooded
AT*UT
AT*Uta
AT*wowed
AT*Wyatt
ATX*atishoo

After sort, I'd expect all the AT* to end up in one chunk but when you run this data through sort, the output == input. Why is that? I'm not specifying any option to ignore non-alphabetic characters or anything. Just sort dict > out.

My version of sort comes from coreutils 8.5-1ubuntu3.

Aaron Digulla

Posted 2010-12-28T11:21:10.933

Reputation: 6 035

I can confirm i'm having the exact same problem too under debian, but with commas, it's driving me crazy. How can you sort csvs when it behaves like this by default? – Owl – 2019-12-25T18:24:25.233

@Owl Use the proper tool for the job: xsv or csvkit. – Aaron Digulla – 2020-01-07T14:32:52.200

@aaron digulla sort is the proper tool for the job, it's just that it's default behaviour is non standard for some distributions – Owl – 2020-01-07T17:34:35.957

Works for me. Maybe an alias somewhere? – Matthieu Cartier – 2010-12-28T12:18:54.430

Answers

18

sort --version-sort filename 

This preserves the natural order of numbers.

Reuben L.

Posted 2010-12-28T11:21:10.933

Reputation: 942

2Works without needing to change the environment, +1 – Meredith – 2016-10-19T18:54:47.087

@AaronDigulla: I suspect that it treats the strings in version sort as nearly the dumbest way to sort things, so it ignores locale and only handles numbers in a special way. – JohnEye – 2017-05-11T14:03:10.077

Thank you that fixed my csv data. – Owl – 2019-12-25T18:27:25.833

4+1 That works but why? There are only a few single-digits in the text. – Aaron Digulla – 2010-12-28T14:56:20.697

23

Setting LC_ALL=C restored the traditional sorting order in my case. Package: coreutils Version: 8.5-1ubuntu3

export LC_ALL=C 

rahul_jk

Posted 2010-12-28T11:21:10.933

Reputation: 331

works for me in Raspbian//Pixel... the sort 'annoyance' ignoring the special chars was killig me... thanks. – ZEE – 2017-10-12T01:22:42.683

2no need to export or even set local and possibly mess with something else. Just set it in the call to sort: LC_ALL=C sort. E.g. echo -e 'a\n*\n*b\nc' | LC_ALL=C sort, LC_ALL will not be changed outside the call to sort – Hashbrown – 2018-09-25T02:22:33.263

LANG=C also works. What puzzles me: LANG is set to en_US.UTF-8; why is * still treated special?? – Aaron Digulla – 2010-12-28T14:55:55.910

2LC_COLLATE is the setting that's specific to sort, etc. – Paused until further notice. – 2010-12-28T15:49:09.230

1

Version: sort (GNU coreutils) 8.26

I do it inline:

LANG=C sort FILE

Or by function (changes the original file):

dosort() { local file="$*"; LANG=C sort ${file} -o ${file}.swp; mv ${file}.swp ${file}; cat ${file} ;}

Regis Barbosa

Posted 2010-12-28T11:21:10.933

Reputation: 29

1

To provide a simple answer based on others' comments, that doesn't change your environment:

input_program | LC_COLLATE=C sort | output_program

or

LC_COLLATE=C sort < input_file > output_file

or combinations thereof.

Walf

Posted 2010-12-28T11:21:10.933

Reputation: 254

1

It works as expected for me (on cygwin).

sort input > output results in

AT*AD
AT*Ad
AT*Eydie
AT*I'd
AT*IUD
AT*Ito
AT*OD
AT*UT
AT*Uta
AT*Wyatt
AT*ad
AT*eyed
AT*id
AT*oat
AT*outweigh
AT*owed
AT*wowed
AT0S*eightieths
AT0S*eyetooth's
ATF*ATV
ATF*Ediva
ATF*adv
ATF*edify
ATFKT*advocate
ATFKT*outfought
ATFKTNK*advocating
ATFKTS*advocates
ATHT*Whitehead
ATHT*whitehead
ATJ*adage
ATNXNS*attention's
ATNXNS*attenuation's
ATNXNS*autoignition's
ATP*adobe
ATP0K*idiopathic
ATT*wighted
ATT*witted
ATT*wooded
ATX*atishoo

Is sort aliased to something? try \sort

Also

The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values

Nifle

Posted 2010-12-28T11:21:10.933

Reputation: 31 337

1No alias. Must be some ubuntu/debian specific feature. – Aaron Digulla – 2010-12-28T16:42:37.673

0

With GNU sort you can use --dictionary-order:

NAME
       sort - sort lines of text files

SYNOPSIS
       sort [OPTION]... [FILE]...
       sort [OPTION]... --files0-from=F

DESCRIPTION
       Write sorted concatenation of all FILE(s) to standard output.

       With no FILE, or when FILE is -, read standard input.

       Mandatory arguments to long options are mandatory for short options too.  Ordering options:

       -b, --ignore-leading-blanks
              ignore leading blanks

       -d, --dictionary-order
              consider only blanks and alphanumeric characters

user187214

Posted 2010-12-28T11:21:10.933

Reputation: