How many ways can given subsequence occur in given sequence?

2

Let me know if this task has already been posed. I haven't found it when I looked.


Input

  • master sequence \$\ X = x_1\dots x_n\$: sequence of characters, eg. \$\rm international\$
  • subsequence \$\ Y = y_1\dots y_m\$: sequence of characters, eg. \$\rm intl\$


Output

Number of possible mappings \$\mu: [1\dots m] \rightarrow [1\dots n]\$ of positions in subsequence to positions in the master sequence, such that for all \$j\in[1\ldots m]\$:

  1. \$y_j = x_{\mu(j)}\$, viz. an occurrence of a character in the subsequence maps to an occurrence of the same character in the master sequence

  2. \$\forall\ i \lt j:\ \mu(i) \lt \mu(j)\$, viz. the mapping preserves mutual position of character occurrences.

In the example above, there are 3 possible mappings:

  • \$\rm\underline{int}ernationa\underline{l}\$
  • \$\rm\underline{in}terna\underline{t}iona\underline{l}\$
  • \$\rm\underline{i}nter\underline{n}a\underline{t}iona\underline{l}\$


Rules

Lowest average asymptotic time complexity with respect to \$n+m\$ wins.

If it helps you, you can assume the input is a sequence of bytes (values 0-255).


Test cases

master sequence, subsequence -> number of possible mappings
international, intl -> 3
aaabbb, ab -> 9
aaabbb, aabb -> 9
only one, onl -> 1
none, en -> 0

kyrill

Posted 2019-12-10T00:05:44.567

Reputation: 575

@Don Thousand please stop rewriting x to X. – kyrill – 2019-12-10T00:51:59.780

Note that in your own question, $X$ refers to the master sequence. – Don Thousand – 2019-12-10T01:44:44.067

Could a better measure be O(m + alphabetSize)? – Jonathan Allan – 2019-12-10T01:45:04.470

Some test cases: "aaabbb", "ab" and "aaabbb", "aabb" should both return 9, maybe add some for 0 and 1 mappings. – Matthew Jensen – 2019-12-10T02:15:23.983

@DonThousand note that in my question I state $X = x_1\dots x_n$. – kyrill – 2019-12-10T02:20:36.903

@JonathanAllan I don't know. If you think it's better then maybe it is; I haven't really thought about it thoroughly. – kyrill – 2019-12-10T02:22:43.260

Done, remove the last one if you think it's unnecessary. – Matthew Jensen – 2019-12-10T02:29:45.137

Answers

4

Ruby, \$\mathcal{O}(nm)\$

I think I got the analysis for this right. Should be \$\mathcal{O}(nm)\$ because for each of the \$n\$ characters in the master sequence, it goes through at most \$m\$ steps to check each character of the subsequence (if they are all the same character). It's also optimized to try to be \$\Omega(n+m)\$ best case complexity, which occurs if all the characters in the subsequence are unique.

def submap master, sub
  subIndex = Hash.new{ |hash, key| hash[key] = [] }

  sub.chars.map.with_index do |c, i|
    subIndex[c] << i
  end

  subCount = [0] * sub.length

  master.chars do |c|
    indices = subIndex[c]
    indices.reverse_each do |i|
      subCount[i] += i > 0 ? subCount[i-1] : 1
    end
  end
  return subCount[-1]
end

Try it online!

Explanation

Define the function \$f(X_i,Y_j)\$ as the number of occurrences of the subsequence \$Y_j=y_1...y_j\$ in the sequence \$X_i=x_1...x_i\$. In this situation, it can be shown that

$$f(X_i,Y_j)=\begin{cases} f(X_{i-1},Y_j) & \text{if } x_i\ne y_j,i > 1\\ 0 & \text{if } x_i\ne y_j, i=1\\ f(X_{i-1},Y_j)+f(X_{i-1},Y_{j-1}) & \text{if } x_i=y_j,j > 1\\ \sum_{k=1}^i [x_k = y_j] & \text{if }j=1 \end{cases}$$

That is, if the subsequence is 1 character long, count how many there are in the master sequence. If \$x_i=y_j\$, add the number of times the subsequence \$Y_{j-1}\$ occurred previously in \$X_{i-1}\$, as that is the number of times you can find this subsequence using this specific position \$x_i\$. Otherwise, ignore the character and only look at the sequence before \$x_i\$, that is, \$X_{i-1}\$.

With that, we write our code.

def submap master, sub
  # Minor optimization: Map the characters to indices. This way, we don't have
  # to go through the subsequence every time, which gives the best case scenario
  # a complexity of O(n+m) if all characters in Y are unique.
  subIndex = Hash.new{ |hash, key| hash[key] = [] } 
                                        # Initialize hash with default value []
  sub.chars.map.with_index do |c, i|    # For each character in Y
    subIndex[c] << i                    # Add the index to the hash for that char.
  end

  subCount = [0] * sub.length           # Initialize the previous results f(X, Y_j).

  master.chars do |c|                   # For each character in X
    indices = subIndex[c]               # Look up the indices for that character
    indices.reverse_each do |i|         # For each index (in reverse order)
      subCount[i] += i > 0 ? subCount[i-1] : 1
                                        # Increment f(X, Y_j) by f(X, Y_{j-1}).
    end
  end
  return subCount[-1]                   # Return f(X, Y_m).
end

Value Ink

Posted 2019-12-10T00:05:44.567

Reputation: 10 608

I'd suggest subIndex[c].reverse_each {|index| ...}. – kyrill – 2019-12-10T05:15:53.333

@kyrill Made the change. Obviously this doesn't affect complexity but it is nice for readability :) – Value Ink – 2019-12-13T04:41:32.530

1

Python 3, \$O(n^m)\$

def submappings(master, sub):
  if len(sub) == 0:
    return 1
  count = 0
  for i, c in enumerate(master[:len(master)-len(sub)+1]):
    if c == sub[0]:
      count += submappings(master[i+1:], sub[1:])
  return count

Try it online!

For each character in the subsequence, it looks through (what's left of) the master sequence for it, and moves onto the next one, counting how many combinations there are.

I've made a version that also counts the number of comparisons.

def submappings(master, sub):
  if len(sub) == 0:
    return 1, 0
  count, ops = 0, 0
  for i, c in enumerate(master[:len(master)-len(sub)+1]):
    ops += 1
    if c == sub[0]:
      rec = submappings(master[i+1:], sub[1:])
      count += rec[0]
      ops += rec[1]
  return count, ops

Try it online!

Matthew Jensen

Posted 2019-12-10T00:05:44.567

Reputation: 713

As a small optimization, you can only enumerate master[:-len(sub)+1]. As a slightly more advanced optimization, you can go up to the last occurence of sub[0] in master (cf. str.rfind). Even that is not optimal, but it's a pretty cheap and efficient heuristic in most cases I'd say. – kyrill – 2019-12-10T01:34:35.007

I don't think using str.rfind will improve it on average, it will just look from behind instead of from the front, still searching the whole string. Good idea to only enumerate while the substring fits, though. I don't think it will change the asymptotic complexity though. – Matthew Jensen – 2019-12-10T02:05:52.460

No, it's just a heuristic so the asymptotic complexity remains unchanged. With str.rfind I had in mind the efficient implementation of search for a character in string. I would guess it's implemented in C using strchr or something similar, which uses SIMD instructions, so it would be much faster than doing the same thing in Python. Probably wouldn't make a huge difference though. – kyrill – 2019-12-10T02:19:09.483

@OP What's the best time-complexity you've found? It may be possible that this is close to the best possible. – Don Thousand – 2019-12-10T02:23:00.300

Your algorithm is the only one I've thought about, with the mentioned heuristics. Essentially $\Pi_{x=n-m}^{n} x$, so $O(n^m)$. – kyrill – 2019-12-10T02:24:44.950

It should be possible to improve on this. It often compares the same letters more than once. – Matthew Jensen – 2019-12-10T02:41:05.467

Sure, one thing I thought about is creating for each character of substring a list of its admissible positions in the master string. Then create some sort of index which for each admissible position $i$ of a substring character in the master string points into the list of admissible positions of the next character, such that the position it points to is the nearest one after $i$. Although I'm not sure how expensive would it be to build such an index. – kyrill – 2019-12-10T03:02:36.583

It seems like it would be something like $O(nm^2)$ – Matthew Jensen – 2019-12-10T03:05:57.543

I'm convinced it can be done in $O(mn)$ with some memoization / bottom-up approach (basically starting from the end). This is pretty much a textbook dynamic programming problem. – kyrill – 2019-12-10T03:11:41.933