120
43
I want to count the A's T's C's G's N's and "-" characters in a file, or every letter if needed, is there a quick Unix command to do this?
120
43
I want to count the A's T's C's G's N's and "-" characters in a file, or every letter if needed, is there a quick Unix command to do this?
136
If you want some real speed:
echo 'int cache[256],x,y;char buf[4096],letters[]="tacgn-"; int main(){while((x=read(0,buf,sizeof buf))>0)for(y=0;y<x;y++)cache[(unsigned char)buf[y]]++;for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -w -xc -; ./a.out < file; rm a.out;
Is an incredibly fast pseudo-one-liner.
A simple test shows that on my Core i7 CPU 870 @ 2.93GHz it counts at just over 600MB/s:
$ du -h bigdna
1.1G bigdna
time ./a.out < bigdna
t: 178977308
a: 178958411
c: 178958823
g: 178947772
n: 178959673
-: 178939837
real 0m1.718s
user 0m1.539s
sys 0m0.171s
Unlike solutions involving sorting, this one runs in constant (4K) memory, which is very useful, if your file is far larger than your ram.
And, of course with a little bit of elbow grease, we can shave off 0.7 seconds:
echo 'int cache[256],x,buf[4096],*bp,*ep;char letters[]="tacgn-"; int main(){while((ep=buf+(read(0,buf,sizeof buf)/sizeof(int)))>buf)for(bp=buf;bp<ep;bp++){cache[(*bp)&0xff]++;cache[(*bp>>8)&0xff]++;cache[(*bp>>16)&0xff]++;cache[(*bp>>24)&0xff]++;}for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -O2 -xc -; ./a.out < file; rm a.out;
Nets just over 1.1GB/s finishing in:
real 0m0.943s
user 0m0.798s
sys 0m0.134s
For comparison, I tested some of the other solutions on this page which seemed to have some kind of speed promise.
The sed
/awk
solution made a valiant effort, but died after 30 seconds. With such a simple regex, I expect this to be a bug in sed (GNU sed version 4.2.1):
$ time sed 's/./&\n/g' bigdna | awk '!/^$/{a[$0]++}END{for (i in a)print i,a[i];}'
sed: couldn't re-allocate memory
real 0m31.326s
user 0m21.696s
sys 0m2.111s
The perl method seemed promising as well, but I gave up after running it for 7 minutes
time perl -e 'while (<>) {$c{$&}++ while /./g} print "$c{$_} $_\n" for keys %c' < bigdna
^C
real 7m44.161s
user 4m53.941s
sys 2m35.593s
Firstly, there's an assumption here that character=byte. Probably OK. Secondly, would it be faster to avoid updating counts when characters are not in the target set? Would cost a lookup to see if that's the case, and a conditional on the result. Thirdly, mmap the input? – mc0e – 2015-02-19T17:21:59.227
1+1 For a sane solution when it's lots of data, and not just a handful of bytes. The files are in the disk cache though, aren't they? – Daniel Beck – 2012-10-10T18:24:13.797
Yes, they are. Those are 2nd run times. Still, even with my best eviction efforts, I only saw the times go up By 0.3s. – Dave – 2012-10-10T18:40:54.943
2The neat thing is that it has a complexity of O(N) in processing and O(1) in memory. The pipes usually have O(N log N) in processing (or even O(N^2)) and O(N) in memory. – Martin Ueding – 2012-10-10T19:54:04.873
73You are stretching the definition of "command line" quite a bit, though. – gerrit – 2012-10-10T20:42:01.527
Yes; Yes I am ;) This happens to be one of few instances where the data tends to be very large, and a tight inner loop yields huge bandwidth increases. – Dave – 2012-10-10T21:09:49.297
11
Epic bending of the question's requirements -I approve ;p. http://superuser.com/a/486037/10165 <- someone ran benchmarks, and this is the fastest option.
– Journeyman Geek – 2012-10-11T00:34:34.3572+1 I appreciate me some good use of C in the right places. – Jeff Ferland – 2012-10-11T07:06:22.037
1This is very nice and shows that depending on the data investing a bit into the program is well worth the effort. – Alexander – 2012-10-11T12:06:23.570
1+1 That's a great answer: Creative, unexpected, humorous and efficient. – artistoex – 2012-10-11T15:54:50.273
This is not much faster than http://superuser.com/a/485938/102905
– adavid – 2012-10-12T10:15:27.470@Dave your counter array is int, not long. May be too small. – Thorbjørn Ravn Andersen – 2012-10-12T10:52:55.773
1@adavid: 1.1GB/s versus 20MB/s is not much faster? I see a factor of 55 between the two solutions. – Alexander – 2012-10-12T11:08:40.683
Just a note: I think that the second solutions need int
to be 32-bit. – Flow – 2012-10-14T11:24:45.150
119
grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c
Will do the trick as a one liner. A little explanation is needed though.
grep -o foo.text -e A -e T -e C -e G -e N -e -
greps the file foo.text for letters a and g and the character -
for each character you want to search for. It also prints it one character a line.
sort
sorts it in order. This sets the stage for the next tool
uniq -c
counts the duplicate consecutive occurrences of any line. In this case, since we have a sorted list of characters, we get a neat count of when the characters we grepped out in the first step
If foo.txt contained the string GATTACA-
this is what I'd get from this set of commands
[geek@atremis ~]$ grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c
1 -
3 A
1 C
1 G
2 T
8Bloody unix magic! :D – Pitto – 2012-10-10T14:30:46.397
XD. That it is. There is no better way to handle loads of plain text than good old unix command line tools IMO. There's a reason I have them installed on every box I can, even the windows ones – Journeyman Geek – 2012-10-10T14:49:02.210
27if there is only CTAG- characters in your files, the regexp itself becomes pointless, right ? grep -o . | sort | uniq -c would work equally well, afaik. – sylvainulg – 2012-10-10T14:55:21.597
You have a point there. – Journeyman Geek – 2012-10-10T15:08:41.420
You talk about cygwin, right? – Pitto – 2012-10-10T15:47:08.113
7+1 I've been using grep for 25 years and didn't know about -o
. – LarsH – 2012-10-10T19:28:30.217
9@JourneymanGeek: The problem with this is that it generates a lot of data that is then forwarded to sort. It would be cheaper to let a program parse each character. See Dave's answer for a O(1) instead O(N) memory complexity answer. – Martin Ueding – 2012-10-10T19:52:22.940
2@Pitto Native Windows builds of coreutils are widely available - just ask Google or somesuch – OrangeDog – 2012-10-10T20:08:17.657
1@pitto: GOW / busybox-windows / unxutils /gnuwin32 - I use the first on most of my systems, but take your pick. – Journeyman Geek – 2012-10-11T00:02:49.747
Since they're all single characters, even if there were other characters in the file you could use grep -o '[ACGT-]' file | ...
– Kevin – 2012-10-11T00:31:38.250
I adopted this from a similar set of pipes that I had used for finding certain strings from a large CSV file - the seperation of -e values was likely because I was switching between various combinations of strings when the original script was written - I might be able to get away with leaving out the specific search criteria completely if I know the file only consists of charecters I want to count. – Journeyman Geek – 2012-10-11T00:39:23.647
Minor point, but since you're just searching for literals fgrep is a little more appropriate; furthermore, it allows you to combine the -e params into a single string. fgrep -o $'A\nT\nC\nG\nN\n-' foo.txt | sort | uniq -c
– kojiro – 2012-10-11T04:15:58.660
I was under the impression these days fgrep grep and egrep were the same command, aliased. – Journeyman Geek – 2012-10-11T06:19:37.140
1
@LarsH don't fret too much. gnu grep has snuck new features in over the years. You've only missed it for 11 or so =)
– Matt – 2012-10-11T17:08:07.547@queueoverflow "it would be cheaper" - strongly depends on what the expensive part is here. If the bottleneck is how fast the original file can be read, then it is insignificant. – Thorbjørn Ravn Andersen – 2012-10-12T10:50:04.090
46
Try this one, inspired by @Journeyman's answer.
grep -o -E 'A|T|C|G|N|-' foo.txt | sort | uniq -c
The key is knowing about the -o option for grep. This splits the match up, so that each output line corresponds to a single instance of the pattern, rather than the entire line for any line that matches. Given this knowledge, all we need is a pattern to use, and a way to count the lines. Using a regex, we can create a disjunctive pattern that will match any of the characters you mention:
A|T|C|G|N|-
This means "match A or T or C or G or N or -". The manual describes various regular expression syntax you can use.
Now we have output that looks something like this:
$ grep -o -E 'A|T|C|G|N|-' foo.txt
A
T
C
G
N
-
-
A
A
N
N
N
Our last step is to merge and count all the similar lines, which can simply be accomplished with a sort | uniq -c
, as in @Journeyman's answer. The sort gives us output like this:
$ grep -o -E 'A|T|C|G|N|-' foo.txt | sort
-
-
A
A
A
C
G
N
N
N
N
T
Which, when piped through uniq -c
, finally resembles what we want:
$ grep -o -E 'A|T|C|G|N|-' foo.txt | sort | uniq -c
2 -
3 A
1 C
1 G
4 N
1 T
Addendum: If you want to total the number of A, C, G, N, T, and - characters in a file, you can pipe the grep output through wc -l
instead of sort | uniq -c
. There's lots of different things you can count with only slight modifications to this approach.
I really need to delve into the rabbitholes that are coreutils and regex. This is somewhat more elegant than mine for it ;p – Journeyman Geek – 2012-10-10T14:36:29.497
2
@JourneymanGeek: Learing regex is well worth the trouble, since it's useful for so many things. Just understand it's limitations, and don't abuse the power by attempting to do things outside the scope of regexes capabilites, like trying to parse XHTML.
– crazy2be – 2012-10-10T15:17:26.30720grep -o '[ATCGN-]' could be a bit more readable here. – sylvainulg – 2012-10-10T15:45:08.263
14
One liner counting all letters using Python:
$ python -c "import collections, pprint; pprint.pprint(dict(collections.Counter(open('FILENAME_HERE', 'r').read())))"
...producing a YAML friendly output like this:
{'\n': 202,
' ': 2153,
'!': 4,
'"': 62,
'#': 12,
'%': 9,
"'": 10,
'(': 84,
')': 84,
'*': 1,
',': 39,
'-': 5,
'.': 121,
'/': 12,
'0': 5,
'1': 7,
'2': 1,
'3': 1,
':': 65,
';': 3,
'<': 1,
'=': 41,
'>': 12,
'@': 6,
'A': 3,
'B': 2,
'C': 1,
'D': 3,
'E': 25}
It's interesting to see how most of the times Python can easily beat even bash in terms of clarity of code.
11
Similar to Guru's awk
method:
perl -e 'while (<>) {$c{$&}++ while /./g} print "$c{$_} $_\n" for keys %c'
10
After using UNIX for a couple of years, you get very proficient at linking together a number of small operations to accomplish various filtering and counting tasks. Everyone has their own style-- some like awk
and sed
, some like cut
and tr
. Here's the way I would do it:
To process a particular filename:
od -a FILENAME_HERE | cut -b 9- | tr " " \\n | egrep -v "^$" | sort | uniq -c
or as a filter:
od -a | cut -b 9- | tr " " \\n | egrep -v "^$" | sort | uniq -c
It works like this:
od -a
separates the file into ASCII characters.cut -b 9-
eliminates the prefix od
puts.tr " " \\n
converts the spaces between characters to newlines so there's one character per line.egrep -v "^$"
gets rid of all the extra blank lines this creates.sort
gathers instances of each character together.uniq -c
counts the number of repeats of each line.I fed it "Hello, world!" followed by a newline and got this:
1 ,
1 !
1 d
1 e
1 H
3 l
1 nl
2 o
1 r
1 sp
1 w
9
The sed
part being based on @Guru’s answer, here’s another approach using uniq
, similar to David Schwartz’ solution.
$ cat foo
aix
linux
bsd
foo
$ sed 's/\(.\)/\1\n/g' foo | sort | uniq -c
4
1 a
1 b
1 d
1 f
2 i
1 l
1 n
2 o
1 s
1 u
2 x
1Use [[:alpha:]]
rather than .
in sed
to only match characters and not newlines. – Claudius – 2012-10-10T11:54:16.023
1[[:alpha:]]
will fail if you're also trying to match stuff like -
, which was mentioned in the question – Izkata – 2012-10-10T14:58:18.460
Correct. It might be nicer to add a second expression to sed to first filter out everything else and then explicitly match on the desired characters: sed -e 's/[^ATCGN-]//g' -e 's/\([ATCGN-]\)/\1\n/g' foo | sort | uniq -c
. However, I don’t know how to get rid off the newlines there :\ – Claudius – 2012-10-10T15:09:19.730
7
You can combine grep
and wc
to do this:
grep -o 'character' file.txt | wc -w
grep
searches the given file(s) for the specified text, and the -o
option tells it to only print the actual matches (ie. the characters you were looking for), rather than the default which is to print each line in which the search text was found on.
wc
prints the byte, word and line counts for each file, or in this case, the output of the grep
command. The -w
option tells it to count words, with each word being an occurrence of your search character. Of course, the -l
option (which counts lines) would work as well, since grep
prints each occurrence of your search character on a separate line.
To do this for a number of characters at once, put the characters in an array and loop over it:
chars=(A T C G N -)
for c in "${chars[@]}"; do echo -n $c ' ' && grep -o $c file.txt | wc -w; done
Example: for a file containing the string TGC-GTCCNATGCGNNTCACANN-
, the output would be:
A 3
T 4
C 6
G 4
N 5
- 2
For more information, see man grep
and man wc
.
The downside of this approach, as user Journeyman Geek notes below in a comment, is that grep
has to be run once for each character. Depending on how large your files are, this can incur a noticeable performance hit. On the other hand, when done this way it's a bit easier to quickly see which characters are being searched for, and to add/remove them, as they're on a separate line from the rest of the code.
3they'd need to repeat it per charecter they want... I'd add. I could swear there's a more elegant solution but it needs more poking ;p – Journeyman Geek – 2012-10-10T11:27:18.747
@JourneymanGeek Good point. One approach that springs to mind is putting the characters in an array and looping through it. I've updated my post. – Indrek – 2012-10-10T11:55:52.950
too complex IMO. Just use grep -e a -e t and so on. If you put it in an array and loop through it, wouldn't you have to run through the grep cycle once per character? – Journeyman Geek – 2012-10-10T11:58:24.793
@JourneymanGeek You're probably right. uniq -c
also seems like a better way of getting nicely formatted output. I'm no *nix guru, the above is just what I managed to put together from my limited knowledge and some man pages :) – Indrek – 2012-10-10T12:04:19.530
So did I ;p, and one of my assignments last term involved sorting through about 5000 address book entries, and uniq made it a LOT easier. – Journeyman Geek – 2012-10-10T12:06:27.763
7
Using the sequence lines from 22hgp10a.txt the timing difference between grep and awk on my system make using awk the way to go...
[Edit]: After having seen Dave's compiled solution forget awk too, as his completed in ~ 0.1 seconds on this file for full case sensitive counting.
# A nice large sample file.
wget http://gutenberg.readingroo.ms/etext02/22hgp10a.txt
# Omit the regular text up to the start `>chr22` indicator.
sed -ie '1,/^>chr22/d' 22hgp10a.txt
sudo test # Just get sudo setup to not ask for password...
# ghostdog74 answered a question <linked below> about character frequency which
# gave me all case sensitive [ACGNTacgnt] counts in ~10 seconds.
sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" \
awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' 22hgp10a.txt
# The grep version given by Journeyman Geek took a whopping 3:41.47 minutes
# and yielded the case sensitive [ACGNT] counts.
sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" \
grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c
The case insensitive version of ghostdog's completed in ~ 14 seconds.
The sed is explained in the accepted answer to this question.
The benchmarking is as in the accepted answer to this question.
The accepted answer by ghostdog74 was to this question.
1You can s/cache[letters[x]]/cache[letters[x]]+cache[toupper(letters[x])]
mine to make it case insensitive without affecting its speed. – Dave – 2012-10-10T19:25:15.413
6
I think any decent implementation avoids sort. But because it's also bad idea to read everything 4 times, I think one could somehow generate a stream that goes through 4 filters, one for each character, which is filtered out and where the stream lengths are also somehow calculated.
time cat /dev/random | tr -d -C 'AGCTN\-' | head -c16M >dna.txt
real 0m5.797s
user 0m6.816s
sys 0m1.371s
$ time tr -d -C 'AGCTN\-' <dna.txt | tee >(wc -c >tmp0.txt) | tr -d 'A' |
tee >(wc -c >tmp1.txt) | tr -d 'G' | tee >(wc -c >tmp2.txt) | tr -d 'C' |
tee >(wc -c >tmp3.txt) | tr -d 'T' | tee >(wc -c >tmp4.txt) | tr -d 'N' |
tee >(wc -c >tmp5.txt) | tr -d '\-' | wc -c >tmp6.txt && cat tmp[0-6].txt
real 0m0.742s
user 0m0.883s
sys 0m0.866s
16777216
13983005
11184107
8387205
5591177
2795114
0
The cumulative sums are then in tmp[0-6].txt .. so work is still in progress
There are merely 13 pipes in this approach, which converts to less than 1 Mb of memory.
Of course my favourite solution is:
time cat >f.c && gcc -O6 f.c && ./a.out
# then type your favourite c-program
real 0m42.130s
This is a very nice use of tr
. – adavid – 2012-10-12T10:12:18.520
4
I didn't knew about uniq
nor about grep -o
, but since my comments on @JourneymanGeek and @crazy2be had such support, maybe I should turn it into an anwser of its own:
If you know there is only "good" characters (those you want to count) in your file, you can go for
grep . -o YourFile | sort | uniq -c
If only some characters must be counted and others not (i.e. separators)
grep '[ACTGN-]' YourFile | sort | uniq -c
The first one uses the regular expression wildcard .
, which match any single character. The second one use a 'set of accepted characters', with no specific order, except that -
must come last (A-C
is interpreted as 'any character betweenA
and C
). Quotes are required in that case so that your shell do not try to expand that to check single-character files if any (and produce a "no match" error if none).
Note that "sort" also has a -u
nique flag so that it only reports things once, but no companion flag to count duplicates, so uniq
is indeed mandatory.
-
doesn't have to come last if you escape it with a backslash: '[A\-CTGN]'
should work just fine. – Indrek – 2012-10-11T12:04:18.963
2
A silly one:
tr -cd ATCGN- | iconv -f ascii -t ucs2 | tr '\0' '\n' | sort | uniq -c
tr
to delete (-d
) all characters but (-c
) ATCGN-iconv
to convert to ucs2 (UTF16 limited to 2 bytes) to add a 0 byte after every byte,tr
to translate those NUL characters to NL. Now every character is on its own linesort | uniq -c
to count each uniq lineThat's an alternative to the non-standard (GNU) -o
grep option.
Could you give a brief explanation of the commands and logic here? – Andrew Lambert – 2012-10-10T23:09:41.567
2
time $( { tr -cd ACGTD- < dna.txt | dd | tr -d A | dd | tr -d C | dd | tr -d G |
dd | tr -d T | dd | tr -d D | dd | tr -d - | dd >/dev/null; } 2>tmp ) &&
grep byte < tmp | sort -r -g | awk '{ if ((s-$0)>=0) { print s-$0} s=$0 }'
The output format is not the best...
real 0m0.176s
user 0m0.200s
sys 0m0.160s
2069046
2070218
2061086
2057418
2070062
2052266
Theory of operation:
Speed seems to be 60MBps +
Improvements: get rid of tmp? use 'paste' to print the letter involved? – Aki Suihkonen – 2012-10-11T09:09:48.333
1
Sample file:
$ cat file
aix
unix
linux
Command:
$ sed 's/./&\n/g' file | awk '!/^$/{a[$0]++}END{for (i in a)print i,a[i];}'
u 2
i 3
x 3
l 1
n 2
a 1
-1 for lack of clarity, and for posting a one-liner without explanation. AFAIK, this could be a fork bomb – PPC – 2012-10-10T20:52:36.100
1
Combining a few others
chars='abcdefghijklmnopqrstuvwxyz-'
grep -o -i "[$chars]" foo|sort | uniq -c
Add | sort -nr
to see the results in order of frequency.
1
Short answer:
If circumstances permit, compare file sizes of low character sets to one with no characters to get an offset and just count bytes.
Ah, but the tangled details:
Those are all Ascii characters. One byte per. Files of course have extra metadata prepended for a variety of stuff used by the OS and the app that created it. In most cases I would expect these to take up the same amount of space regardless of metadata but I would try to maintain identical circumstances when you first test the approach and then verify that you have a constant offset before not worrying about it. The other gotcha is that line-breaks typically involve two ascii white space characters and any tabs or spaces would be one each. If you can be certain these will be present and there's no way to know how many beforehand, I'd stop reading now.
It might seem like a lot of constraints but if you can easily establish them, this strikes me as the easiest/best performing approach if you have a ton of these to look at (which seems likely if that's DNA). Checking a ton of files for length and subtracting a constant would be gobs faster than running grep (or similar) on every one.
If:
And Two Things That Might Not Matter But I Would Test With First
Try Finding The Offset By Doing the Following:
Compare an empty file to one with a few easily-human-counted characters to one with a few more characters. If subtracting the empty file from both of the other two files gives you byte counts that match character count, you're done. Check file lengths and subtract that empty amount. If you want to try to figure out multi-line files, most editors attach two special one-byte characters for line breaks since one tends to be ignored by Microsoft but you'd have to at least grep for white-space chars in which case you might as well do it all with grep.
1
Haskell way:
import Data.Ord
import Data.List
import Control.Arrow
main :: IO ()
main = interact $
show . sortBy (comparing fst) . map (length &&& head) . group . sort
it works like this:
112123123412345
=> sort
111112222333445
=> group
11111 2222 333 44 5
=> map (length &&& head)
(5 '1') (4 '2') (3 '3') (2 '4') (1,'5')
=> sortBy (comparing fst)
(1 '5') (2 '4') (3 '3') (4 '2') (5 '1')
=> one can add some pretty-printing here
...
compiling and using:
$ ghc -O2 q.hs
[1 of 1] Compiling Main ( q.hs, q.o )
Linking q ...
$ echo 112123123412345 | ./q
[(1,'\n'),(1,'5'),(2,'4'),(3,'3'),(4,'2'),(5,'1')]%
$ cat path/to/file | ./q
...
not good for huge files maybe.
1
Quick perl hack:
perl -nle 'while(/[ATCGN]/g){$a{$&}+=1};END{for(keys(%a)){print "$_:$a{$_}"}}'
-n
: Iterate over input lines but don't print anything for them-l
: Strip or add line breaks automaticallywhile
: iterate over all occurrences of your requested symbols in the current lineEND
: At the end, print results%a
: Hash where the values are storedCharacters which don't occur at all won't be included in the result.
Why is every answer on this Q a "community wiki"? Haven't seen that before... – Xen2050 – 2016-04-03T18:03:01.263
56Counting bases in DNA strands? – Indrek – 2012-10-10T11:32:47.200
12I love this question, so many different approaches and tools used to solve the same problem. – Journeyman Geek – 2012-10-10T11:42:37.557
10Heh, this is borderline code-golf – Earlz – 2012-10-10T13:37:01.807
13if somone is interested in the windows powershell version:
[System.IO.File]::ReadAllText("C:\yourfile.txt").ToCharArray() | Group-Object $_ | Sort Count -Descending
– Guillaume86 – 2012-10-10T14:53:41.087We were actually discussing a powershell way to do this on root access. Its simpler than I expected – Journeyman Geek – 2012-10-10T15:34:07.613
I'm more a .NET dev then a powershell specialist so a more idiomatic way probably exists (should use Get-Content I guess) – Guillaume86 – 2012-10-10T16:18:21.400
4Ok I think I found the pure PS way:
Get-Content "C:\eula.3082.txt" | % { $_.ToCharArray() } | Group-Object | Sort Count -Descending
– Guillaume86 – 2012-10-10T16:33:32.240Sort the file, and then find the first occurrence of each letter, take away the offsets to determine the count. – Simon O'Doherty – 2012-10-10T18:30:49.390
@SimonO'Doherty By all means, type up a working example and post it as an answer. The more the merrier :) – Indrek – 2012-10-10T19:14:32.267
@Indrek Sadly my linux is too rusty. :/ – Simon O'Doherty – 2012-10-10T20:58:43.300
1
This is a candidate for bucket sort
– artistoex – 2012-10-11T15:50:59.6833@Guillaume86 - in PowerShell you can also use
[char[]]
instead ofToCharArray()
so it's even more idiomatic:gc .\file.txt |% { [char[]]$_ } | group | sort count -des
– Danko Durbić – 2012-10-12T11:55:58.503@DankoDurbić thanks didn't knew how to cast :), side note to people comparing to unix variants, the last part (sort count -des) is just bonus to order results by frequency – Guillaume86 – 2012-10-12T12:06:56.757