All the k-mers/n-grams

21

1

Intro

We have had histograms and counting, but not listing all of them.

Every year, Dyalog Ltd. holds a student competition. The challenge there is to write good APL code. This is a language agnostic edition of this year's sixth problem.

I have explicit permission to post this challenge here from the original author of the competition. Feel free to verify by following the provided link and contacting the author.

Problem

The term k-mer typically refers to all the possible substrings of length k that are contained in a string. In computational genomics, k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA Sequencing. Write a function/program that takes a string and k (the substring length) and returns/outputs a vector of the k-mers of the original string.

Examples

[4,"ATCGAAGGTCGT"]["ATCG","TCGA","CGAA","GAAG","AAGG","AGGT","GGTC","GTCG","TCGT"]

k > string length? Return nothing/any empty result:
[4,"AC"][] or "" or [""]

Adám

Posted 2017-05-01T06:46:36.740

Reputation: 37 779

4Does the order of the output matter? When a substring occurs multiple times, should it be repeated in the output? – feersum – 2017-05-01T08:02:17.310

1

Can I return a string of the required substrings separated by newlines instead of an array of strings, like this?

– Leaky Nun – 2017-05-01T08:39:35.683

May we also input and output the string as an array of characters (like ['A', 'T', 'C', 'G'] instead of "ATCG"? – Adnan – 2017-05-01T08:39:56.723

Are Dyalog APL answers allowed in this PPCG challenge (because the challenge is also hosted by Dyalog)? – user41805 – 2017-05-01T09:14:49.773

Is it acceptable to return an empty string for the empty result while returning an array of strings for the non-empty result? – Jonathan Allan – 2017-05-01T09:28:26.040

@JonathanAllan Yes. – Adám – 2017-05-01T12:22:41.883

1@feersum Order matters, and repetitions should be repeated. This is just like a sliding window. – Adám – 2017-05-01T12:29:31.687

@LeakyNun Yes, if that would be normal for your language. – Adám – 2017-05-01T12:30:05.743

@Adnan Yes, if that would be normal for your language. – Adám – 2017-05-01T12:30:26.593

@KritixiLithos Yes, because this is [tag:code-golf] while that challenge is for quality code. – Adám – 2017-05-01T12:30:54.787

Related (sliding window on a list) – FlipTack – 2017-12-27T18:16:51.327

Answers

15

Jelly, 1 byte

Jelly has a single byte dyadic atom for this very operation

Try it online! (the footer splits the resulting list with newlines, to avoid a mushed representation being printed.)

Jonathan Allan

Posted 2017-05-01T06:46:36.740

Reputation: 67 804

1Somehow the OP must have known... – Leaky Nun – 2017-05-01T07:35:40.227

1@LeakyNun I actually didn't. – Adám – 2017-05-01T12:26:16.193

8

Octave, 28 bytes

@(N,s)s((1:N)+(0:nnz(s)-N)')

Try it online!

For k > string length works in Octave 4.2.1-windows but in tio (Octave 4.0.3) doesn't work.

Creates numeric indexes of consecutive elements and index the string by it.

s= "ATCGAAGGTCGT"
N = 4
idx = (1:N)+(0:nnz(s)-N)'
 =
    1    2    3    4
    2    3    4    5
    3    4    5    6
    4    5    6    7
    5    6    7    8
    6    7    8    9
    7    8    9   10
    8    9   10   11
    9   10   11   12

s(idx) =

ATCG
TCGA
CGAA
GAAG
AAGG
AGGT
GGTC
GTCG
TCGT

rahnema1

Posted 2017-05-01T06:46:36.740

Reputation: 5 435

7

C (GCC on POSIX), 67 66 63 bytes

-3 bytes thanks to @LeakyNun!

f(i,s,j)char*s;{for(;j+i<=strlen(s);puts(""))write(1,s+j++,i);}

Try it online!

betseg

Posted 2017-05-01T06:46:36.740

Reputation: 8 493

I don't think you need j=0. – Leaky Nun – 2017-05-01T10:08:39.620

@LeakyNun the function should be reusable. Try it online! vs Try it online!

– betseg – 2017-05-01T11:43:17.003

Though if i do this...

– betseg – 2017-05-01T11:46:51.567

1You can replace j+i<=strlen(s) with just s[j+i] – user41805 – 2017-05-01T16:01:15.910

7

05AB1E, 2 bytes

Code:

Ν

Explanation:

Π     # Get all substrings of the input
 ù     # Only keep the substrings of length the second input

Uses the 05AB1E encoding.

Try it online!

Adnan

Posted 2017-05-01T06:46:36.740

Reputation: 41 965

5

Python 3,  47 45 42 bytes

-3 bytes thanks to ovs (use Python 3's unpacking to reuse a[n-1:] at the tail.)

f=lambda a,n:a[n-1:]and[a[:n],*f(a[1:],n)]

A recursive function taking the string, a, and the slice length, n, and returning a list of the slices or an empty string.

a[n-1:] takes a slice of the current string from the n-1th (0-indexed) element onward to test whether there are enough elements remaining (an empty string is falsey in Python) - this is shorter than the equivalent len(a)>=n.

  • If there are enough elements a list is constructed, [...], with the first n elements of the string, a[:n], and the unpacked result of calling the function again, *f(...), with a dequeued version of the current input (without the first element), a[1:].

  • If there are not enough elements the tail of the recursion is reached when a[n-1:] is returned (in this case an empty string).

Try it online!


45 for Python 2 or 3 with:

f=lambda a,n:a[n-1:]and[a[:n]]+f(a[1:],n)or[]

Jonathan Allan

Posted 2017-05-01T06:46:36.740

Reputation: 67 804

f=lambda a,n:a[n-1:]and[a[:n],*f(a[1:],n)] for 42 bytes (Python 3) TIO – ovs – 2017-05-01T08:32:21.753

@ovs, very nice, I've asked if this is acceptable (since they empty result is a string, while the non-empty result is a list). – Jonathan Allan – 2017-05-01T09:29:35.060

5

Brachylog, 3 bytes

s₎ᶠ

Try it online!

Specs:

  • Input: ["ATCGAAGGTCGT",4]
  • Argument: Z
  • Output: Z = ["ATCG","TCGA","CGAA","GAAG","AAGG","AGGT","GGTC","GTCG","TCGT"]

How it works

s₎ᶠ
s    Output is a substring of first element of input,
 ₎   with length specified by second element of input.
  ᶠ  Find all solutions.

Leaky Nun

Posted 2017-05-01T06:46:36.740

Reputation: 45 011

4

J, 2 bytes

,\

It is not a complete program, but a function with an operator.

Call it as such:

echo 4 ,\ 'ATCGAAGGTCGT'

Try it online!

How it works

The operator (called "conjunction") \ (named "infix") is used as such:

(x u\ y) applies verb u to successive parts of list y (called infixes).

The function (called "verb") u in this case is the function , which is a simple "append" function:

Creates an array containing the items of x followed by the items of y.

Leaky Nun

Posted 2017-05-01T06:46:36.740

Reputation: 45 011

3

Mathematica, 21 bytes

##~StringPartition~1&

Anonymous function. Takes a string and a number (in that order) as input and returns a list of strings as output.

LegionMammal978

Posted 2017-05-01T06:46:36.740

Reputation: 15 731

3

R, 65 61 bytes

-2 bytes thanks to MickyT

-2 bytes by changing the indexing

returns an anonymous function.

function(s,n,x=nchar(s))`if`(n>x,'',substring(s,x:n-n+1,n:x))

substring cycles through the indices (as opposed to substr which does not), and if the starting index is less than 1, it defaults to 1 instead, so it checks and returns the empty string.

x:n-n+1 is equivalent to 1:(x-n+1) since : takes precedence over sums/differences

Try it online!

Giuseppe

Posted 2017-05-01T06:46:36.740

Reputation: 21 077

you can save a couple of bytes with function(s,n,x=nchar(s))if(n>x,'',substring(s,1:(x-n+1),n:x)) – MickyT – 2017-05-01T19:43:05.593

@MickyT, thanks! I also noticed I could drop some bytes by changing the index computation – Giuseppe – 2017-05-01T19:51:53.963

2

Pyth, 2 bytes

.:

It is not a complete program, but a built-in function.

Call it as such:

.:"ATCGAAGGTCGT"4

Try it online!

Full program:

.:.*

Try it online!

(The .* is splat.)

Leaky Nun

Posted 2017-05-01T06:46:36.740

Reputation: 45 011

While it doesn't really matter, .:F is a byte shorter for the full program. – FryAmTheEggman – 2017-05-01T17:21:51.153

2

Jellyfish, 7 bytes

p
_I
\i

Try it online!

How it works

In linear: p(\(I,i)), where p is print and \ gets the required substrings.

I is the raw first input while i is the evaluated second input.

In Jellyfish, every function and operator gets two arguments, one from the right, and one from the bottom. Here, the function p gets the argument from the output of _, which is required if we are to use the operator \ to get substrings.

Leaky Nun

Posted 2017-05-01T06:46:36.740

Reputation: 45 011

2

Java (OpenJDK 8), 92 bytes

void f(String s,int n){for(int i=n;i<=s.length();)System.out.println(s.substring(i-n,i++));}

Try it online!

Leaky Nun

Posted 2017-05-01T06:46:36.740

Reputation: 45 011

You can golf it by 3 bytes like this: String[]f(String s,int n){int i=0,t=s.length()-n+1;String[]r=new String[t];for(;i<t;r[i]=s.substring(i,n+i++));return r;}

– Kevin Cruijssen – 2017-05-01T11:47:34.800

1@KevinCruijssen I changed my answer. – Leaky Nun – 2017-05-01T11:50:12.807

2

Retina, 41 38 bytes

.*$
$*
!&`(.)+(?=.*¶(?<-1>.)+(?(1)¶)$)

Try it online!

Takes the string and count on separate lines. The first two lines are used to convert the count from decimal to unary, so if unary input is acceptable then the byte count would be reduced to 34 31. Edit: Saved 3 bytes thanks to @FryAmTheEggman. Or, if you prefer, a 48-byte version that handles newlines in the string, although that does produce confusing output:

.*$
$*
!&`(\S|\s)+(?=[\S\s]*¶(?<-1>.)+(?(1)$.)$)

Neil

Posted 2017-05-01T06:46:36.740

Reputation: 95 035

@KritixiLithos I don't see how your solution takes the count into account... – Neil – 2017-05-01T11:52:06.437

Oh, sorry I just had a brain fart >_< – user41805 – 2017-05-01T12:07:24.010

This doesn't necessarily work if the string can contain a newline, so I think you can change (?!) to . – FryAmTheEggman – 2017-05-01T17:34:35.453

2

Octave with Image Package, 29 bytes

@(s,n)[im2col(+s, [1 n])' '']

Try it online!

Explanation

The function im2col(m,b) takes a matrix m, extracts blocks of size b from it, and arranges them as columns. By default blocks are sliding (as opposed to distinct). Here the matrix m is a row vector of the ASCII codes of the input string s (this is done as +s, which is shorter than the standard double(s)), and the size b is [1 n] to obtain horizontally sliding blocks of n elements.

The result is transposed (using complex-conjugate transpose ', which is shorter than transpose .') to turn the columns into rows, and then it is converted back to char ([... ''], which is shorter than the standard char(...)).

Luis Mendo

Posted 2017-05-01T06:46:36.740

Reputation: 87 464

2

Python 2, 54 bytes

lambda x,n:map(''.join,zip(*[x[b:]for b in range(n)]))

Try it online!

Adnan

Posted 2017-05-01T06:46:36.740

Reputation: 41 965

2

Clojure, 19 bytes

Well this is handy:

#(partition % 1 %2)

Examples:

(def f #(partition % 1 %2))
(println [(f 4 "ATCGAAGGTCGT")
          (f 4 "abc")])

[((A T C G) (T C G A) (C G A A) (G A A G) (A A G G) (A G G T) (G G T C) (G T C G) (T C G T))
 ()]

NikoNyrh

Posted 2017-05-01T06:46:36.740

Reputation: 2 361

2

CJam, 4 bytes

{ew}

Anonymous block that expects the arguments on the stack and leaves the result on the stack after.

Try it online!

ew is a built-in that does exactly what is required.

Business Cat

Posted 2017-05-01T06:46:36.740

Reputation: 8 927

5I have only one thing to say: ew – Adám – 2017-05-01T14:30:05.527

2

oK, 2 bytes

':

oK has a sliding window operator!

zgrep

Posted 2017-05-01T06:46:36.740

Reputation: 1 291

1

Python 3, 49 bytes

f=lambda a,n:[a[i:i+n]for i in range(len(a)-n+1)]

Try it online!

A non-recursive solution, albeit not shorter.

Compatible with Python 2.

Leaky Nun

Posted 2017-05-01T06:46:36.740

Reputation: 45 011

You can drop f=, saving two bytes, because you do not use f anywhere else. By default, functions that are just declared and not used can be left unnamed. – Mr. Xcoder – 2017-05-02T17:30:56.497

1

JavaScript (Firefox 30-57), 51 bytes

(s,n,t='')=>[for(c of s)if((t+=c)[n-1])t.slice(-n)]

64 bytes in ES6:

(s,n,t=s.slice(0,--n))=>[...s.slice(n)].map(c=>(t+=c).slice(~n))

Neil

Posted 2017-05-01T06:46:36.740

Reputation: 95 035

1

PHP, 75 Bytes

Online Version

for([,$n,$s]=$argv;$i+$n-1<strlen($s);)$r[]=substr($s,$i++,$n);print_r($r);

80 Bytes without double values

for([,$n,$s]=$argv;$i+$n-1<strlen($s);)$r[$p=substr($s,$i++,$n)]=$p;print_r($r);

Jörg Hülsermann

Posted 2017-05-01T06:46:36.740

Reputation: 13 026

1

Haskell, 39 bytes

n#s|length s<n=[]|1<2=take n s:n#tail s

Usage example: 4 # "ABCDEF" -> ["ABCD","BCDE","CDEF"]. Try it online!

A simple recursion that keeps the first n chars of the input string an continues with the tail of the string as long as its length is not less than n.

nimi

Posted 2017-05-01T06:46:36.740

Reputation: 34 639

1

Microsoft Sql Server, 199 bytes

create function dbo.f(@s nvarchar(max),@ int)returns table as return
with v as(select 2 p,left(@s,@)g where len(@s)>=@ union all
select p+1,substring(@s,p,@)from v where len(@s)>p-2+@)select g from v

Check it.

Andrei Odegov

Posted 2017-05-01T06:46:36.740

Reputation: 939

1

PowerShell, 70 bytes

$b={$c,$s=$args;[regex]::matches($s,"(?=(.{$c}))")|%{''+$_.groups[1]}}

Try it online!

Andrei Odegov

Posted 2017-05-01T06:46:36.740

Reputation: 939

1

Stacked, 7 bytes

infixes

Try it online!

Pretty standard. Without this builtin, it becomes 20 bytes:

{%x#'y-#+:>y#-#>x\#}

Which is:

{%x#'y-#+:>y#-#>x\#}
{%                 }   dyad; first arg: x, second arg: y
  x#'                  length of x (the array)
     y-                minus y (the skew)
       #+              plus 1
         :>            range [x, y]
           y#-         y minus 1
              #>       range from [[x, y], [x, y] + y]
                x\#    get indices from x

Conor O'Brien

Posted 2017-05-01T06:46:36.740

Reputation: 36 228

1

MATL, 3 bytes

YC!

Try it online!

Explanation

YC   % Sliding blocks of input string with input size, arranged as columns
!    % Transpose

Luis Mendo

Posted 2017-05-01T06:46:36.740

Reputation: 87 464

1

C# 89 bytes

void a(string s,int n){for(int i=n;i<=s.Length;)Console.WriteLine(s.Substring(i++-n,n));}

Try it online!

Best method I could find in C# is basically the same as Java

Skidsdev

Posted 2017-05-01T06:46:36.740

Reputation: 9 656

1

Ruby, 48 46 bytes

->(s,k){0.upto(s.size-k).map{|i|s[i..i+k-1]}}

No particular tricks, just a stabby-lambda defining a function that pulls the required substring from each valid starting point.

Saved two bytes, since there doesn't seem to be a need to store the lambda.

Chowlett

Posted 2017-05-01T06:46:36.740

Reputation: 221

1

V, 16 bytes

òÀ|ly0Ïp
"_xòkVp

Not terribly well golfed I'm afraid, struggling with "delete the string if k > len(str)". Input is in the file, k is an argument. Golfing before explanation

Try it online!

nmjcman101

Posted 2017-05-01T06:46:36.740

Reputation: 3 274

What happens if you do not try to handle the k > len(str) case? – Adám – 2017-05-02T14:34:43.767

Depending on the method I use (in this one in particular) it just leaves the string as is – nmjcman101 – 2017-05-02T14:37:52.503

1

Dyalog APL, 13 11 bytes

⊢,/⍨⊣⌊1+∘⍴⊢

It does exactly as intended now. Stupid errors. Thanks to Adám for giving me the hint that there is an 11 byte solution, even though this isn't the one he was looking for.

Zacharý

Posted 2017-05-01T06:46:36.740

Reputation: 5 710

… except when it doesn't.

– Adám – 2017-05-08T11:13:15.460

There, it's fixed. – Zacharý – 2017-05-08T11:20:09.930

Good. Now golf it! (Hint: I can do it in 11.) – Adám – 2017-05-08T11:21:04.347

Well, now it works. – Zacharý – 2017-05-08T11:23:32.757

The third time's the charm. Still waiting for -2 bytes… – Adám – 2017-05-08T11:33:52.213

Would it be acceptable to switch the order of the operands? – Zacharý – 2017-05-08T12:17:59.877

Yes. I'm curious why you'd want to, though. – Adám – 2017-05-08T12:23:16.043

Never mind, I don't need it. – Zacharý – 2017-05-08T12:32:19.873

Does that work? – Zacharý – 2017-05-08T12:35:24.410

That is clever. Not the 11-byte solution I had in mind, but certainly more elegant. – Adám – 2017-05-08T12:45:05.457

1

Standard ML (mosml), 109 65 61 bytes

fun f(n,x)=if n>length(x)then[]else List.take(x,n)::f(n,tl x)

Takes a number and a character list (quite a common alternative to strings in the SML world). (Really works on all lists of course.)

Usage:

- f(3,explode("ABCDEFGH"));
> val it =
    [[#"A", #"B", #"C"], [#"B", #"C", #"D"], [#"C", #"D", #"E"],
     [#"D", #"E", #"F"], [#"E", #"F", #"G"], [#"F", #"G", #"H"]] :
  char list list
- f(7, explode("ABCD"));
> val it = [] : char list list

Changelog:

  • Right, there is a standard library.. (-44 bytes)
  • Change comparison and nil to [] as suggested (-4 bytes)

L3viathan

Posted 2017-05-01T06:46:36.740

Reputation: 3 151

1You can save a byte by changing if length(x)<n then to if n>length(x)then. However, as it is perfectly possible for SML to handle strings, I'm not sure it is allowed to require explode being already applied to the input string. – Laikoni – 2017-05-13T18:17:18.590

Also then nil else can be shortened to then[]else. – Laikoni – 2017-05-13T18:20:26.183

@Laikoni not sure either, but  ¯\(ツ) – L3viathan – 2017-05-13T18:23:35.623

Using some other library functions I got a 68 byte version which deals with strings instead of char lists. Also your approach can be shortened to 54 bytes: fun f$n=if n>length$then[]else List.take($,n)::f(tl$)n.

– Laikoni – 2017-05-15T07:52:53.330

0

Standard ML, 68 bytes

fun f$n=List.tabulate(size$ -n+1,fn m=>substring($,m,n))handle _=>[]

Try it online! Example usage: f "abcdef" 3 returns ["abc", "bcd", "cde", "def"].

Explanation:

For two bytes more we can replace the identifier $ by s to get better readable code:

fun f s n=List.tabulate(size s-n+1,fn m=>substring(s,m,n))handle _=>[]

The function takes a string s and an integer n as arguments. List.tabulate(m,g) takes an integer m and some function g and builds the list [g 0, g 1, ..., g(m-1)]. In this case, we give for m the size of the string s minus the n-gram length n (+1 to avoid off-by-one-errors). For each such m we return the substring of s at position m with length n.

In case of n > size s, an exception is thrown. handle _=>[] catches the exception and returns the empty list.

Laikoni

Posted 2017-05-01T06:46:36.740

Reputation: 23 676

0

brainfuck, 52 bytes

[[>],+[<]>-]>[->]<[[<]>[.>]<[<]++++++++++.>[-]>[>],]

Try it online!

Takes K as the value on the starting cell and the string as input.

Jo King

Posted 2017-05-01T06:46:36.740

Reputation: 38 234

0

q/kdb+, 19 bytes

Solution:

{(0-x)_x#'next\[y]}

Example:

q){(0-x)_x#'next\[y]}[4;"ABCDEFGHIJ"]
"ABCD"
"BCDE"
"CDEF"
"DEFG"
"EFGH"
"FGHI"
"GHIJ"
q){(0-x)_x#'next\[y]}[4;"AC"]
q)

Explanation:

Use next and scan over the input string, then take the first 'x' (sliding window size), then drop off the extras from the end:

{(0-x)_x#'next\[y]} / the solution
{                 } / anonymous lambda with implicit args x and y
          next\[y]  / take 'next' element of y until exhausted
       x#'          / x take-each (take first x chars of each)
 (0-x)              / negate x
      _             / drop (negative drops from the end rather than front)

streetster

Posted 2017-05-01T06:46:36.740

Reputation: 3 635

0

Japt, 2 bytes

Straightforward, built-in solution.

ãV

Try it

Shaggy

Posted 2017-05-01T06:46:36.740

Reputation: 24 623