What's the quickest way to count the number of each character in a file?

120

43

I want to count the A's T's C's G's N's and "-" characters in a file, or every letter if needed, is there a quick Unix command to do this?

Kirt

Posted 2012-10-10T11:10:07.800

Reputation: 5 119

Why is every answer on this Q a "community wiki"? Haven't seen that before... – Xen2050 – 2016-04-03T18:03:01.263

56Counting bases in DNA strands? – Indrek – 2012-10-10T11:32:47.200

12I love this question, so many different approaches and tools used to solve the same problem. – Journeyman Geek – 2012-10-10T11:42:37.557

10Heh, this is borderline code-golf – Earlz – 2012-10-10T13:37:01.807

13if somone is interested in the windows powershell version: [System.IO.File]::ReadAllText("C:\yourfile.txt").ToCharArray() | Group-Object $_ | Sort Count -Descending – Guillaume86 – 2012-10-10T14:53:41.087

We were actually discussing a powershell way to do this on root access. Its simpler than I expected – Journeyman Geek – 2012-10-10T15:34:07.613

I'm more a .NET dev then a powershell specialist so a more idiomatic way probably exists (should use Get-Content I guess) – Guillaume86 – 2012-10-10T16:18:21.400

4Ok I think I found the pure PS way: Get-Content "C:\eula.3082.txt" | % { $_.ToCharArray() } | Group-Object | Sort Count -Descending – Guillaume86 – 2012-10-10T16:33:32.240

Sort the file, and then find the first occurrence of each letter, take away the offsets to determine the count. – Simon O'Doherty – 2012-10-10T18:30:49.390

@SimonO'Doherty By all means, type up a working example and post it as an answer. The more the merrier :) – Indrek – 2012-10-10T19:14:32.267

@Indrek Sadly my linux is too rusty. :/ – Simon O'Doherty – 2012-10-10T20:58:43.300

1

This is a candidate for bucket sort

– artistoex – 2012-10-11T15:50:59.683

3@Guillaume86 - in PowerShell you can also use [char[]] instead of ToCharArray() so it's even more idiomatic: gc .\file.txt |% { [char[]]$_ } | group | sort count -des – Danko Durbić – 2012-10-12T11:55:58.503

@DankoDurbić thanks didn't knew how to cast :), side note to people comparing to unix variants, the last part (sort count -des) is just bonus to order results by frequency – Guillaume86 – 2012-10-12T12:06:56.757

Answers

136

If you want some real speed:

echo 'int cache[256],x,y;char buf[4096],letters[]="tacgn-"; int main(){while((x=read(0,buf,sizeof buf))>0)for(y=0;y<x;y++)cache[(unsigned char)buf[y]]++;for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -w -xc -; ./a.out < file; rm a.out;

Is an incredibly fast pseudo-one-liner.

A simple test shows that on my Core i7 CPU 870 @ 2.93GHz it counts at just over 600MB/s:

$ du -h bigdna 
1.1G    bigdna

time ./a.out < bigdna 
t: 178977308
a: 178958411
c: 178958823
g: 178947772
n: 178959673
-: 178939837

real    0m1.718s
user    0m1.539s
sys     0m0.171s

Unlike solutions involving sorting, this one runs in constant (4K) memory, which is very useful, if your file is far larger than your ram.

And, of course with a little bit of elbow grease, we can shave off 0.7 seconds:

echo 'int cache[256],x,buf[4096],*bp,*ep;char letters[]="tacgn-"; int main(){while((ep=buf+(read(0,buf,sizeof buf)/sizeof(int)))>buf)for(bp=buf;bp<ep;bp++){cache[(*bp)&0xff]++;cache[(*bp>>8)&0xff]++;cache[(*bp>>16)&0xff]++;cache[(*bp>>24)&0xff]++;}for(x=0;x<sizeof letters-1;x++)printf("%c: %d\n",letters[x],cache[letters[x]]);}' | gcc -O2 -xc -; ./a.out < file; rm a.out;

Nets just over 1.1GB/s finishing in:

real    0m0.943s
user    0m0.798s
sys     0m0.134s

For comparison, I tested some of the other solutions on this page which seemed to have some kind of speed promise.

The sed/awk solution made a valiant effort, but died after 30 seconds. With such a simple regex, I expect this to be a bug in sed (GNU sed version 4.2.1):

$ time sed 's/./&\n/g' bigdna | awk '!/^$/{a[$0]++}END{for (i in a)print i,a[i];}' 
sed: couldn't re-allocate memory

real    0m31.326s
user    0m21.696s
sys     0m2.111s

The perl method seemed promising as well, but I gave up after running it for 7 minutes

time perl -e 'while (<>) {$c{$&}++ while /./g} print "$c{$_} $_\n" for keys %c' < bigdna 
^C

real    7m44.161s
user    4m53.941s
sys     2m35.593s

Dave

Posted 2012-10-10T11:10:07.800

Reputation: 361

Firstly, there's an assumption here that character=byte. Probably OK. Secondly, would it be faster to avoid updating counts when characters are not in the target set? Would cost a lookup to see if that's the case, and a conditional on the result. Thirdly, mmap the input? – mc0e – 2015-02-19T17:21:59.227

1+1 For a sane solution when it's lots of data, and not just a handful of bytes. The files are in the disk cache though, aren't they? – Daniel Beck – 2012-10-10T18:24:13.797

Yes, they are. Those are 2nd run times. Still, even with my best eviction efforts, I only saw the times go up By 0.3s. – Dave – 2012-10-10T18:40:54.943

2The neat thing is that it has a complexity of O(N) in processing and O(1) in memory. The pipes usually have O(N log N) in processing (or even O(N^2)) and O(N) in memory. – Martin Ueding – 2012-10-10T19:54:04.873

73You are stretching the definition of "command line" quite a bit, though. – gerrit – 2012-10-10T20:42:01.527

Yes; Yes I am ;) This happens to be one of few instances where the data tends to be very large, and a tight inner loop yields huge bandwidth increases. – Dave – 2012-10-10T21:09:49.297

11

Epic bending of the question's requirements -I approve ;p. http://superuser.com/a/486037/10165 <- someone ran benchmarks, and this is the fastest option.

– Journeyman Geek – 2012-10-11T00:34:34.357

2+1 I appreciate me some good use of C in the right places. – Jeff Ferland – 2012-10-11T07:06:22.037

1This is very nice and shows that depending on the data investing a bit into the program is well worth the effort. – Alexander – 2012-10-11T12:06:23.570

1+1 That's a great answer: Creative, unexpected, humorous and efficient. – artistoex – 2012-10-11T15:54:50.273

This is not much faster than http://superuser.com/a/485938/102905

– adavid – 2012-10-12T10:15:27.470

@Dave your counter array is int, not long. May be too small. – Thorbjørn Ravn Andersen – 2012-10-12T10:52:55.773

1@adavid: 1.1GB/s versus 20MB/s is not much faster? I see a factor of 55 between the two solutions. – Alexander – 2012-10-12T11:08:40.683

Just a note: I think that the second solutions need int to be 32-bit. – Flow – 2012-10-14T11:24:45.150

119

grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c

Will do the trick as a one liner. A little explanation is needed though.

grep -o foo.text -e A -e T -e C -e G -e N -e - greps the file foo.text for letters a and g and the character - for each character you want to search for. It also prints it one character a line.

sort sorts it in order. This sets the stage for the next tool

uniq -c counts the duplicate consecutive occurrences of any line. In this case, since we have a sorted list of characters, we get a neat count of when the characters we grepped out in the first step

If foo.txt contained the string GATTACA-this is what I'd get from this set of commands

[geek@atremis ~]$ grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c
      1 -
      3 A
      1 C
      1 G
      2 T

Journeyman Geek

Posted 2012-10-10T11:10:07.800

Reputation: 119 122

8Bloody unix magic! :D – Pitto – 2012-10-10T14:30:46.397

XD. That it is. There is no better way to handle loads of plain text than good old unix command line tools IMO. There's a reason I have them installed on every box I can, even the windows ones – Journeyman Geek – 2012-10-10T14:49:02.210

27if there is only CTAG- characters in your files, the regexp itself becomes pointless, right ? grep -o . | sort | uniq -c would work equally well, afaik. – sylvainulg – 2012-10-10T14:55:21.597

You have a point there. – Journeyman Geek – 2012-10-10T15:08:41.420

You talk about cygwin, right? – Pitto – 2012-10-10T15:47:08.113

7+1 I've been using grep for 25 years and didn't know about -o. – LarsH – 2012-10-10T19:28:30.217

9@JourneymanGeek: The problem with this is that it generates a lot of data that is then forwarded to sort. It would be cheaper to let a program parse each character. See Dave's answer for a O(1) instead O(N) memory complexity answer. – Martin Ueding – 2012-10-10T19:52:22.940

2@Pitto Native Windows builds of coreutils are widely available - just ask Google or somesuch – OrangeDog – 2012-10-10T20:08:17.657

1@pitto: GOW / busybox-windows / unxutils /gnuwin32 - I use the first on most of my systems, but take your pick. – Journeyman Geek – 2012-10-11T00:02:49.747

Since they're all single characters, even if there were other characters in the file you could use grep -o '[ACGT-]' file | ... – Kevin – 2012-10-11T00:31:38.250

I adopted this from a similar set of pipes that I had used for finding certain strings from a large CSV file - the seperation of -e values was likely because I was switching between various combinations of strings when the original script was written - I might be able to get away with leaving out the specific search criteria completely if I know the file only consists of charecters I want to count. – Journeyman Geek – 2012-10-11T00:39:23.647

Minor point, but since you're just searching for literals fgrep is a little more appropriate; furthermore, it allows you to combine the -e params into a single string. fgrep -o $'A\nT\nC\nG\nN\n-' foo.txt | sort | uniq -c – kojiro – 2012-10-11T04:15:58.660

I was under the impression these days fgrep grep and egrep were the same command, aliased. – Journeyman Geek – 2012-10-11T06:19:37.140

1

@LarsH don't fret too much. gnu grep has snuck new features in over the years. You've only missed it for 11 or so =)

– Matt – 2012-10-11T17:08:07.547

@queueoverflow "it would be cheaper" - strongly depends on what the expensive part is here. If the bottleneck is how fast the original file can be read, then it is insignificant. – Thorbjørn Ravn Andersen – 2012-10-12T10:50:04.090

46

Try this one, inspired by @Journeyman's answer.

grep -o -E 'A|T|C|G|N|-' foo.txt | sort | uniq -c

The key is knowing about the -o option for grep. This splits the match up, so that each output line corresponds to a single instance of the pattern, rather than the entire line for any line that matches. Given this knowledge, all we need is a pattern to use, and a way to count the lines. Using a regex, we can create a disjunctive pattern that will match any of the characters you mention:

A|T|C|G|N|-

This means "match A or T or C or G or N or -". The manual describes various regular expression syntax you can use.

Now we have output that looks something like this:

$ grep -o -E 'A|T|C|G|N|-' foo.txt 
A
T
C
G
N
-
-
A
A
N
N
N

Our last step is to merge and count all the similar lines, which can simply be accomplished with a sort | uniq -c, as in @Journeyman's answer. The sort gives us output like this:

$ grep -o -E 'A|T|C|G|N|-' foo.txt | sort
-
-
A
A
A
C
G
N
N
N
N
T

Which, when piped through uniq -c, finally resembles what we want:

$ grep -o -E 'A|T|C|G|N|-' foo.txt | sort | uniq -c
      2 -
      3 A
      1 C
      1 G
      4 N
      1 T

Addendum: If you want to total the number of A, C, G, N, T, and - characters in a file, you can pipe the grep output through wc -l instead of sort | uniq -c. There's lots of different things you can count with only slight modifications to this approach.

crazy2be

Posted 2012-10-10T11:10:07.800

Reputation: 371

I really need to delve into the rabbitholes that are coreutils and regex. This is somewhat more elegant than mine for it ;p – Journeyman Geek – 2012-10-10T14:36:29.497

2

@JourneymanGeek: Learing regex is well worth the trouble, since it's useful for so many things. Just understand it's limitations, and don't abuse the power by attempting to do things outside the scope of regexes capabilites, like trying to parse XHTML.

– crazy2be – 2012-10-10T15:17:26.307

20grep -o '[ATCGN-]' could be a bit more readable here. – sylvainulg – 2012-10-10T15:45:08.263

14

One liner counting all letters using Python:

$ python -c "import collections, pprint; pprint.pprint(dict(collections.Counter(open('FILENAME_HERE', 'r').read())))"

...producing a YAML friendly output like this:

{'\n': 202,
 ' ': 2153,
 '!': 4,
 '"': 62,
 '#': 12,
 '%': 9,
 "'": 10,
 '(': 84,
 ')': 84,
 '*': 1,
 ',': 39,
 '-': 5,
 '.': 121,
 '/': 12,
 '0': 5,
 '1': 7,
 '2': 1,
 '3': 1,
 ':': 65,
 ';': 3,
 '<': 1,
 '=': 41,
 '>': 12,
 '@': 6,
 'A': 3,
 'B': 2,
 'C': 1,
 'D': 3,
 'E': 25}

It's interesting to see how most of the times Python can easily beat even bash in terms of clarity of code.

Giampaolo Rodolà

Posted 2012-10-10T11:10:07.800

Reputation: 131

11

Similar to Guru's awk method:

perl -e 'while (<>) {$c{$&}++ while /./g} print "$c{$_} $_\n" for keys %c'

user1686

Posted 2012-10-10T11:10:07.800

Reputation: 283 655

10

After using UNIX for a couple of years, you get very proficient at linking together a number of small operations to accomplish various filtering and counting tasks. Everyone has their own style-- some like awk and sed, some like cut and tr. Here's the way I would do it:

To process a particular filename:

 od -a FILENAME_HERE | cut -b 9- | tr " " \\n | egrep -v "^$" | sort | uniq -c

or as a filter:

 od -a | cut -b 9- | tr " " \\n | egrep -v "^$" | sort | uniq -c

It works like this:

  1. od -a separates the file into ASCII characters.
  2. cut -b 9- eliminates the prefix od puts.
  3. tr " " \\n converts the spaces between characters to newlines so there's one character per line.
  4. egrep -v "^$" gets rid of all the extra blank lines this creates.
  5. sort gathers instances of each character together.
  6. uniq -c counts the number of repeats of each line.

I fed it "Hello, world!" followed by a newline and got this:

  1 ,
  1 !
  1 d
  1 e
  1 H
  3 l
  1 nl
  2 o
  1 r
  1 sp
  1 w

David Schwartz

Posted 2012-10-10T11:10:07.800

Reputation: 58 310

9

The sed part being based on @Guru’s answer, here’s another approach using uniq, similar to David Schwartz’ solution.

$ cat foo
aix
linux
bsd
foo
$ sed 's/\(.\)/\1\n/g' foo | sort | uniq -c
4 
1 a
1 b
1 d
1 f
2 i
1 l
1 n
2 o
1 s
1 u
2 x

Claudius

Posted 2012-10-10T11:10:07.800

Reputation: 6 330

1Use [[:alpha:]] rather than . in sed to only match characters and not newlines. – Claudius – 2012-10-10T11:54:16.023

1[[:alpha:]] will fail if you're also trying to match stuff like -, which was mentioned in the question – Izkata – 2012-10-10T14:58:18.460

Correct. It might be nicer to add a second expression to sed to first filter out everything else and then explicitly match on the desired characters: sed -e 's/[^ATCGN-]//g' -e 's/\([ATCGN-]\)/\1\n/g' foo | sort | uniq -c. However, I don’t know how to get rid off the newlines there :\ – Claudius – 2012-10-10T15:09:19.730

7

You can combine grep and wc to do this:

grep -o 'character' file.txt | wc -w

grep searches the given file(s) for the specified text, and the -o option tells it to only print the actual matches (ie. the characters you were looking for), rather than the default which is to print each line in which the search text was found on.

wc prints the byte, word and line counts for each file, or in this case, the output of the grep command. The -w option tells it to count words, with each word being an occurrence of your search character. Of course, the -l option (which counts lines) would work as well, since grep prints each occurrence of your search character on a separate line.

To do this for a number of characters at once, put the characters in an array and loop over it:

chars=(A T C G N -)
for c in "${chars[@]}"; do echo -n $c ' ' && grep -o $c file.txt | wc -w; done

Example: for a file containing the string TGC-GTCCNATGCGNNTCACANN-, the output would be:

A  3
T  4
C  6
G  4
N  5
-  2

For more information, see man grep and man wc.


The downside of this approach, as user Journeyman Geek notes below in a comment, is that grep has to be run once for each character. Depending on how large your files are, this can incur a noticeable performance hit. On the other hand, when done this way it's a bit easier to quickly see which characters are being searched for, and to add/remove them, as they're on a separate line from the rest of the code.

Indrek

Posted 2012-10-10T11:10:07.800

Reputation: 21 756

3they'd need to repeat it per charecter they want... I'd add. I could swear there's a more elegant solution but it needs more poking ;p – Journeyman Geek – 2012-10-10T11:27:18.747

@JourneymanGeek Good point. One approach that springs to mind is putting the characters in an array and looping through it. I've updated my post. – Indrek – 2012-10-10T11:55:52.950

too complex IMO. Just use grep -e a -e t and so on. If you put it in an array and loop through it, wouldn't you have to run through the grep cycle once per character? – Journeyman Geek – 2012-10-10T11:58:24.793

@JourneymanGeek You're probably right. uniq -c also seems like a better way of getting nicely formatted output. I'm no *nix guru, the above is just what I managed to put together from my limited knowledge and some man pages :) – Indrek – 2012-10-10T12:04:19.530

So did I ;p, and one of my assignments last term involved sorting through about 5000 address book entries, and uniq made it a LOT easier. – Journeyman Geek – 2012-10-10T12:06:27.763

7

Using the sequence lines from 22hgp10a.txt the timing difference between grep and awk on my system make using awk the way to go...

[Edit]: After having seen Dave's compiled solution forget awk too, as his completed in ~ 0.1 seconds on this file for full case sensitive counting.

# A nice large sample file.
wget http://gutenberg.readingroo.ms/etext02/22hgp10a.txt

# Omit the regular text up to the start `>chr22` indicator.
sed -ie '1,/^>chr22/d' 22hgp10a.txt

sudo test # Just get sudo setup to not ask for password...

# ghostdog74 answered a question <linked below> about character frequency which
# gave me all case sensitive [ACGNTacgnt] counts in ~10 seconds.
sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" \
awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' 22hgp10a.txt

# The grep version given by Journeyman Geek took a whopping 3:41.47 minutes
# and yielded the case sensitive [ACGNT] counts.
sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" \
grep -o foo.text -e A -e T -e C -e G -e N -e -|sort|uniq -c

The case insensitive version of ghostdog's completed in ~ 14 seconds.

The sed is explained in the accepted answer to this question.
The benchmarking is as in the accepted answer to this question.
The accepted answer by ghostdog74 was to this question.

Thell

Posted 2012-10-10T11:10:07.800

Reputation: 422

1You can s/cache[letters[x]]/cache[letters[x]]+cache[toupper(letters[x])] mine to make it case insensitive without affecting its speed. – Dave – 2012-10-10T19:25:15.413

6

I think any decent implementation avoids sort. But because it's also bad idea to read everything 4 times, I think one could somehow generate a stream that goes through 4 filters, one for each character, which is filtered out and where the stream lengths are also somehow calculated.

time cat /dev/random | tr -d -C 'AGCTN\-' | head -c16M >dna.txt
real    0m5.797s
user    0m6.816s
sys     0m1.371s

$ time tr -d -C 'AGCTN\-' <dna.txt | tee >(wc -c >tmp0.txt) | tr -d 'A' | 
tee >(wc -c >tmp1.txt) | tr -d 'G' | tee >(wc -c >tmp2.txt) | tr -d 'C' | 
tee >(wc -c >tmp3.txt) | tr -d 'T' | tee >(wc -c >tmp4.txt) | tr -d 'N' | 
tee >(wc -c >tmp5.txt) | tr -d '\-' | wc -c >tmp6.txt && cat tmp[0-6].txt

real    0m0.742s
user    0m0.883s
sys     0m0.866s

16777216
13983005
11184107
8387205
5591177
2795114
0

The cumulative sums are then in tmp[0-6].txt .. so work is still in progress

There are merely 13 pipes in this approach, which converts to less than 1 Mb of memory.
Of course my favourite solution is:

time cat >f.c && gcc -O6 f.c && ./a.out
# then type your favourite c-program
real    0m42.130s

Aki Suihkonen

Posted 2012-10-10T11:10:07.800

Reputation: 111

This is a very nice use of tr. – adavid – 2012-10-12T10:12:18.520

4

I didn't knew about uniq nor about grep -o, but since my comments on @JourneymanGeek and @crazy2be had such support, maybe I should turn it into an anwser of its own:

If you know there is only "good" characters (those you want to count) in your file, you can go for

grep . -o YourFile | sort | uniq -c

If only some characters must be counted and others not (i.e. separators)

grep '[ACTGN-]' YourFile | sort | uniq -c

The first one uses the regular expression wildcard ., which match any single character. The second one use a 'set of accepted characters', with no specific order, except that - must come last (A-C is interpreted as 'any character betweenA and C). Quotes are required in that case so that your shell do not try to expand that to check single-character files if any (and produce a "no match" error if none).

Note that "sort" also has a -unique flag so that it only reports things once, but no companion flag to count duplicates, so uniq is indeed mandatory.

sylvainulg

Posted 2012-10-10T11:10:07.800

Reputation: 573

- doesn't have to come last if you escape it with a backslash: '[A\-CTGN]' should work just fine. – Indrek – 2012-10-11T12:04:18.963

2

A silly one:

tr -cd ATCGN- | iconv -f ascii -t ucs2 | tr '\0' '\n' | sort | uniq -c
  • tr to delete (-d) all characters but (-c) ATCGN-
  • iconv to convert to ucs2 (UTF16 limited to 2 bytes) to add a 0 byte after every byte,
  • another tr to translate those NUL characters to NL. Now every character is on its own line
  • sort | uniq -c to count each uniq line

That's an alternative to the non-standard (GNU) -o grep option.

sch

Posted 2012-10-10T11:10:07.800

Reputation: 287

Could you give a brief explanation of the commands and logic here? – Andrew Lambert – 2012-10-10T23:09:41.567

2

time $( { tr -cd ACGTD- < dna.txt | dd | tr -d A | dd | tr -d C | dd | tr -d G |
dd | tr -d T | dd | tr -d D | dd | tr -d - | dd >/dev/null; } 2>tmp ) &&
grep byte < tmp | sort -r -g | awk '{ if ((s-$0)>=0) { print s-$0} s=$0 }'

The output format is not the best...

real    0m0.176s
user    0m0.200s
sys     0m0.160s
2069046
2070218
2061086
2057418
2070062
2052266

Theory of operation:

  • $( { command | command } 2> tmp ) redirects the stderr of the stream to a temporary file.
  • dd outputs stdin to stdout and outputs the number of bytes passed to stderr
  • tr -d filters out one character at a time
  • grep and sort filters the output of dd to descending order
  • awk calculates the difference
  • sort is used only in post-processing stage to handle the uncertainty of exit order of instances of dd

Speed seems to be 60MBps +

Aki Suihkonen

Posted 2012-10-10T11:10:07.800

Reputation: 111

Improvements: get rid of tmp? use 'paste' to print the letter involved? – Aki Suihkonen – 2012-10-11T09:09:48.333

1

Sample file:

$ cat file
aix
unix
linux

Command:

$ sed 's/./&\n/g' file | awk '!/^$/{a[$0]++}END{for (i in a)print i,a[i];}'
u 2
i 3
x 3
l 1
n 2
a 1

Guru

Posted 2012-10-10T11:10:07.800

Reputation: 19

-1 for lack of clarity, and for posting a one-liner without explanation. AFAIK, this could be a fork bomb – PPC – 2012-10-10T20:52:36.100

1

Combining a few others

chars='abcdefghijklmnopqrstuvwxyz-'
grep -o -i "[$chars]" foo|sort | uniq -c

Add | sort -nr to see the results in order of frequency.

Keith Wolters

Posted 2012-10-10T11:10:07.800

Reputation: 131

1

Short answer:

If circumstances permit, compare file sizes of low character sets to one with no characters to get an offset and just count bytes.

Ah, but the tangled details:

Those are all Ascii characters. One byte per. Files of course have extra metadata prepended for a variety of stuff used by the OS and the app that created it. In most cases I would expect these to take up the same amount of space regardless of metadata but I would try to maintain identical circumstances when you first test the approach and then verify that you have a constant offset before not worrying about it. The other gotcha is that line-breaks typically involve two ascii white space characters and any tabs or spaces would be one each. If you can be certain these will be present and there's no way to know how many beforehand, I'd stop reading now.

It might seem like a lot of constraints but if you can easily establish them, this strikes me as the easiest/best performing approach if you have a ton of these to look at (which seems likely if that's DNA). Checking a ton of files for length and subtracting a constant would be gobs faster than running grep (or similar) on every one.

If:

  • These are simple unbroken strings in pure text files
  • They are in identical file types created by the same vanilla non-formatting text-editor like Scite (pasting is okay as long as you check for spaces/returns) or some basic program somebody wrote

And Two Things That Might Not Matter But I Would Test With First

  • The file names are of equal length
  • The files are in the same directory

Try Finding The Offset By Doing the Following:

Compare an empty file to one with a few easily-human-counted characters to one with a few more characters. If subtracting the empty file from both of the other two files gives you byte counts that match character count, you're done. Check file lengths and subtract that empty amount. If you want to try to figure out multi-line files, most editors attach two special one-byte characters for line breaks since one tends to be ignored by Microsoft but you'd have to at least grep for white-space chars in which case you might as well do it all with grep.

Erik Reppen

Posted 2012-10-10T11:10:07.800

Reputation: 111

1

Haskell way:

import Data.Ord
import Data.List
import Control.Arrow

main :: IO ()
main = interact $
  show . sortBy (comparing fst) . map (length &&& head) . group . sort

it works like this:

112123123412345
=> sort
111112222333445
=> group
11111 2222 333 44 5
=> map (length &&& head)
(5 '1') (4 '2') (3 '3') (2 '4') (1,'5')
=> sortBy (comparing fst)
(1 '5') (2 '4') (3 '3') (4 '2') (5 '1')
=> one can add some pretty-printing here
...

compiling and using:

$ ghc -O2 q.hs
[1 of 1] Compiling Main             ( q.hs, q.o )
Linking q ...
$ echo 112123123412345 | ./q
[(1,'\n'),(1,'5'),(2,'4'),(3,'3'),(4,'2'),(5,'1')]%       
$ cat path/to/file | ./q
...

not good for huge files maybe.

ht.

Posted 2012-10-10T11:10:07.800

Reputation: 101

1

Quick perl hack:

perl -nle 'while(/[ATCGN]/g){$a{$&}+=1};END{for(keys(%a)){print "$_:$a{$_}"}}'
  • -n: Iterate over input lines but don't print anything for them
  • -l: Strip or add line breaks automatically
  • while: iterate over all occurrences of your requested symbols in the current line
  • END: At the end, print results
  • %a: Hash where the values are stored

Characters which don't occur at all won't be included in the result.

MvG

Posted 2012-10-10T11:10:07.800

Reputation: 1 259