Compute the height of a radix tree

8

Introduction

A radix tree, also known as compressed trie or compressed prefix tree, is a tree-like data structure for storing a set of strings. The edges of the tree are labeled by nonempty strings, and each node is either terminal or nonterminal. The strings that the tree contains are exactly the labels of all paths from the root to a terminal node. The tree must be in the normal form defined by the following conditions:

  • All nonterminal non-root nodes have at least two children.
  • For every node, all outgoing edges have different first characters.

For example, here is the radix tree containing the set ["test", "testing", "tested", "team", "teams", "technical", "sick", "silly"], with (N) representing a nonterminal node and (T) a terminal node:

-(N)-te-(N)-st-(T)-ing-(T)
  |      |      | 
  |      |      +-ed-(T)
  |      |
  |      +-am-(T)-s-(T)
  |      |
  |      +-chnical-(T)
  |
  +-si-(N)-ck-(T)
        |
        +-lly-(T)

In this challenge, your task is to compute the height of the tree, given the strings as input.

Input

Your input is a non-empty list of strings of lowercase ASCII letters. It will not contain duplicates, but it may be in any order. It may contain the empty string. You may take the input in any reasonable format.

Output

Your output shall be the length of the longest root-to-leaf path in the corresponding radix tree, measured in the number of nodes it contains.

In the above example, the correct output is 4, corresponding to the paths

(N)-te-(N)-st-(T)-ing-(T)
(N)-te-(N)-st-(T)-ed-(T)
(N)-te-(N)-am-(T)-s-(T)

which contain 4 nodes.

Further examples

Here are a couple more examples of radix trees:

[""]
-(T)

["","fuller"]
-(T)-fuller-(T)

["","full","fuller"]
-(T)-full-(T)-er-(T)

["full","fuller"]
-(N)-full-(T)-er-(T)

["full","filler"]
-(N)-f-(N)-ull-(T)
        |
        +-iller-(T)

Rules and scoring

You can write a full program or a function. This is code golf, so the lowest byte count wins.

You can use any built-ins or libraries you want, but remember to include all imports etc. in your byte count. Third-party libraries – those not included in the standard install of your language – are allowed, but must be listed in the header separately, e.g. Python + pytrie0.2, 60 bytes.

Test cases

[""] -> 1
["fuller"] -> 2
["","fuller"] -> 2
["","full","fuller"] -> 3
["full","fuller"] -> 3
["full","filler"] -> 3
["full","filler","filter"] -> 4
["full","filler","fi","filter"] -> 5
["test","testing","tested","team","teams","technical","sick","silly"] -> 4
["a","aaaa","aabbaa","aabbaab","abaaa","aab","aabbb","aabba"] -> 8
["dbdbaca","ab","a","caaabaaa","adbbabdb","dbdbdbaca","dbbadbacaba","db"] -> 4
["db","dbbdbb","dbaa","cabcacaa","","acaabcacab","b","abaaaca","bacaaaaa"] -> 3
["aabaabdbb","bacaabadbbdb","abb","aabaa","ab","bcadbb","adbbcaaadbbb","caaa","bbdbcacadbab","dbbdbdb"] -> 4
["bbcaaabbbabbcadbbacadbbdbdb","b","bbbbaaaaaababa","ca","bb","bdbbacadbbdbbdbbababaacaca","abbaabbabcabaaa","bbbacacacabcacacabaaabb","bbcaaaab","bbbbcaacaadbcaaa","babbabcadbdbacacabbcacab","abcabbbaacadbcadb","bbcabbcadbcacaacaadbadbcaadb","dbbbdbbdbacaabbacabcadbdbacaca","bbaabdbdb","cabcadbbbadbadbbaadbcaca","adbadbadbdbcacadbdbbcaadbcaca","abaabbcab","aaabcaabcaab","bacacabcacaacadbadbb"] -> 6

Zgarb

Posted 2016-12-02T08:19:04.343

Reputation: 39 083

In your first example, I feel like "am" should be followed by a node with two children, one with -""-(T) and the other with -"s"-(T). (Compare to "slow"/"slowly" in the second example on the page you linked to.) This won't affect the length of the longest root-to-leaf path, though. – Greg Martin – 2016-12-02T08:35:50.060

@GregMartin The Wikipedia page has a slightly different implementation. I feel that this one is more natural, and as you said, it doesn't affect the height. – Zgarb – 2016-12-02T08:37:59.667

Answers

2

JavaScript (Firefox 30-57), 137 bytes

f=(a,b=["",...a],s=new Set(b.map(w=>w[0])))=>b.length>1?(s.size>1)+Math.max(...(for(c of s)f(a,[for(w of b)if(w&&w[0]==c)w.slice(1)]))):1

There must be something wrong with my algorithm as I seem to have to special-case an empty string parameter.

Neil

Posted 2016-12-02T08:19:04.343

Reputation: 95 035

In my reference solution, the empty string is a special case too. It's because the root node has its own rules. – Zgarb – 2016-12-02T10:55:35.370

I sort of had to special-case it as well, but that special case is only a single character in my solution (the . at the end of the first line). – Martin Ender – 2016-12-02T11:04:16.910

1

Retina, 69 55 bytes

The trailing linefeed is significant.

m`.(?<=(?=\D*^\1(?!\2))\D*^(.*)(.))|^.
:$&
T`w
O`
A-2`

Input is a linefeed-separated list.

Performance can be increased significantly by inserting a \A before the last \D on the first line.

Explanation

A node is inserted into a word in three places:

  • The beginning.
  • After any longest prefix shared with another word. That is, a shared prefix after which the two words differ.
  • The end.

In Retina, it tends to be shorter to produce N+1 when you've counted N things, so we're ignoring the one at the end. However, for the case of the empty input, we'll then also have to ignore the start, because beginning and end are the same.

To do the actual counting we insert : into every place where there is a node. Then we simply find the maximum number of : in a word and return that plus 1. Here is how the code does it:

m`.(?<=(?=\D*^\1(?!\2))\D*^(.*)(.))|^.
:$&

This detects all the nodes. It matches a character. Then it enters a lookbehind which captures that character into group 2 and the prefix before it into group 1. Then it goes all the way to the beginning of the string to start a lookahead, which checks that it can find the prefix somewhere where it's not followed by the captured character. If we find this match we insert a : in front of the character. We also match the first character of the current word to insert a : at the beginning of non-empty words.

T`w

This removes all the letters and leaves only the :.

O`

This sorts the lines, bringing the maximum depth to the end.

A-2`

This discards all but the last line.

And the empty line at the end counts the number of empty matches in that line, which is one more than the number of colons in it.

Try it online! (The first line enables a linefeed-separated test suite, where each test case uses comma separation instead.)

Martin Ender

Posted 2016-12-02T08:19:04.343

Reputation: 184 808