Count the number of cyclic words in an input

9

1

Cyclic Words

Problem Statement

We can think of a cyclic word as a word written in a circle. To represent a cyclic word, we choose an arbitrary starting position and read the characters in clockwise order. So, "picture" and "turepic" are representations for the same cyclic word.

You are given a String[] words, each element of which is a representation of a cyclic word. Return the number of different cyclic words that are represented.

Fastest wins (Big O, where n = number of chars in a string)

eggonlegs

Posted 2013-01-17T07:49:06.660

Reputation: 191

3If you're looking for criticism of your code then the place to go is codereview.stackexchange.com . – Peter Taylor – 2013-01-17T08:21:05.460

Cool. I'll edit for emphasis on the challenge and move the criticism part to code review. Thanks Peter. – eggonlegs – 2013-01-17T08:31:23.887

1What's the winning criteria? Shortest code (Code Golf) or anything else? Are there any limitataions on the form of input and output? Do we need to write a function or a complete program? Does it have to be in Java? – ugoren – 2013-01-17T12:46:11.803

Well the ideal criteria I had in mind was big o time complexity, but we can do length if you like. Sorry, I am new to this. :/ – eggonlegs – 2013-01-17T13:42:21.103

@eggonlegs We are asking, not telling you about the criteria. It's just that this is a site where we pose and answer challanges, and we like them to have a metric that allows us to judge the "winner". Tags like code-golf, code-challenge, king-of-the-hill and details in the text are used to indicate the rules. [Code-challenge] is most general.

– dmckee --- ex-moderator kitten – 2013-01-17T14:00:11.803

Understood. Okay, 'fastest-code' it is. :) – eggonlegs – 2013-01-17T14:05:33.213

1@eggonlegs You specified big-O - but with respect to which parameter? Number of strings in array? Is string comparison then O(1)? Or number of chars in string or total number of chars? Or anything else? – Howard – 2013-01-17T14:13:22.703

Sorry Howard, this was never really meant to be a code golf question, but I have updated it anyway. I want to take into account the speed of the comparison itself. – eggonlegs – 2013-01-17T14:41:29.950

@eggonlegs What is the count of the list {"teaks", "words", "spot", "pots", "sword", "steak", "hand"}? 3 or 6? – DavidC – 2013-01-17T20:52:30.477

1@dude, surely it's 4? – Peter Taylor – 2013-01-17T23:12:33.877

So. You group words together that are rotations of each other and then count the groups. Fair enough. – DavidC – 2013-01-17T23:50:13.240

@dude yes, it is 4 – eggonlegs – 2013-01-18T00:04:47.360

I'll try to do a better job posting a question next time ^_^ – eggonlegs – 2013-01-19T11:22:55.437

Answers

4

Python

Here's my solution. I think it might still be O(n2), but I think the average case is much better than that.

Basically it works by normalizing each string so that any rotation will have the same form. For example:

'amazing' -> 'mazinga'
'mazinga' -> 'mazinga'
'azingam' -> 'mazinga'
'zingama' -> 'mazinga'
'ingamaz' -> 'mazinga'
'ngamazi' -> 'mazinga'
'gamazin' -> 'mazinga'

The normalization is done by looking for the minimum character (by char code), and rotating the string so that character is in the last position. If that character occurs more than once, then the characters after each occurrence are used. This gives each cyclic word a canonical representation, that can be used as a key in a map.

The normalization is n2 in the worst case (where every character in the string is the same, e.g. aaaaaa), but most of the time there's only going to be a few occurrences, and the running time will be closer to n.

On my laptop (dual core Intel Atom @ 1.66GHz and 1GB of ram), running this on /usr/share/dict/words (234,937 words with an average length of 9.5 characters) takes about 7.6 seconds.

#!/usr/bin/python

import sys

def normalize(string):
   # the minimum character in the string
   c = min(string) # O(n) operation
   indices = [] # here we will store all the indices where c occurs
   i = -1       # initialize the search index
   while True: # finding all indexes where c occurs is again O(n)
      i = string.find(c, i+1)
      if i == -1:
         break
      else:
         indices.append(i)
   if len(indices) == 1: # if it only occurs once, then we're done
      i = indices[0]
      return string[i:] + string[:i]
   else:
      i = map(lambda x:(x,x), indices)
      for _ in range(len(string)):                       # go over the whole string O(n)
         i = map(lambda x:((x[0]+1)%len(string), x[1]), i)  # increment the indexes that walk along  O(m)
         c = min(map(lambda x: string[x[0]], i))    # get min character from current indexes         O(m)
         i = filter(lambda x: string[x[0]] == c, i) # keep only the indexes that have that character O(m)
         # if there's only one index left after filtering, we're done
         if len(i) == 1:
            break
      # either there are multiple identical runs, or
      # we found the unique best run, in either case, we start the string from that
      # index
      i = i[0][0]
      return string[i:] + string[:i]

def main(filename):
   cyclic_words = set()
   with open(filename) as words:
      for word in words.readlines():
         cyclic_words.add(normalize(word[:-1])) # normalize without the trailing newline
   print len(cyclic_words)

if __name__ == '__main__':
   if len(sys.argv) > 1:
      main(sys.argv[1])
   else:
      main("/dev/stdin")

Gordon Bailey

Posted 2013-01-17T07:49:06.660

Reputation: 708

3

Python (3) again

The method I used was to calculate a rolling hash of each word starting at each character in the string; since it's a rolling hash, it takes O(n) (where n is the word length) time to compute all the n hashes. The string is treated as a base-1114112 number, which ensures the hashes are unique. (This is similar to the Haskell solution, but more efficient since it only goes through the string twice.)

Then, for each input word, the algorithm checks its lowest hash to see if it's already in the set of hashes seen (a Python set, thus lookup is O(1) in the set's size); if it is, then the word or one of its rotations has already been seen. Otherwise, it adds that hash to the set.

The command-line argument should be the name of a file that contains one word per line (like /usr/share/dict/words).

import sys

def rollinghashes(string):
    base = 1114112
    curhash = 0
    for c in string:
        curhash = curhash * base + ord(c)
    yield curhash
    top = base ** len(string)
    for i in range(len(string) - 1):
        curhash = curhash * base % top + ord(string[i])
        yield curhash

def cycles(words, keepuniques=False):
    hashes = set()
    uniques = set()
    n = 0
    for word in words:
        h = min(rollinghashes(word))
        if h in hashes:
            continue
        else:
            n += 1
            if keepuniques:
                uniques.add(word)
            hashes.add(h)
    return n, uniques

if __name__ == "__main__":
    with open(sys.argv[1]) as words_file:
        print(cycles(line.strip() for line in words_file)[0])

Lowjacker

Posted 2013-01-17T07:49:06.660

Reputation: 4 466

1

Mathematica

Decided to start again, now that I understand the rules of the game (I think).

A 10000 word dictionary of unique randomly composed "words" (lower case only) of length 3. In similar fashion other dictionaries were created consisting of strings of length 4, 5, 6, 7, and 8.

ClearAll[dictionary]      
dictionary[chars_,nWords_]:=DeleteDuplicates[Table[FromCharacterCode@RandomInteger[{97,122},
chars],{nWords}]];
n=16000;
d3=Take[dictionary[3,n],10^4];
d4=Take[dictionary[4,n],10^4];
d5=Take[dictionary[5,n],10^4];
d6=Take[dictionary[6,n],10^4];
d7=Take[dictionary[7,n],10^4];
d8=Take[dictionary[8,n],10^4];

gtakes the current version of dictionary to check. The top word is joined with cyclic variants (if any exist). The word and its matches are appended to the output list, out, of processed words. The output words are removed from the dictionary.

g[{wds_,out_}] := 
   If[wds=={},{wds,out},
   Module[{s=wds[[1]],t,c},
   t=Table[StringRotateLeft[s, k], {k, StringLength[s]}];
   c=Intersection[wds,t];
   {Complement[wds,t],Append[out,c]}]]

f runs through all words dictionary.

f[dict_]:=FixedPoint[g,{dict,{}}][[2]]

Example 1: actual words

r = f[{"teaks", "words", "spot", "pots", "sword", "steak", "hand"}]
Length[r]

{{"steak", "teaks"}, {"hand"}, {"pots", "spot"}, {"sword", "words"}}
4


Example 2: Artificial words. Dictionary of strings of length 3. First, timing. Then the number of cycle words.

f[d3]//AbsoluteTiming
Length[%[[2]]]

d3

5402


Timings as a function of word length. 10000 words in each dictionary.

timings

I don't particularly know how to interpret the findings in terms of O. In simple terms, the timing roughly doubles from the three character dictionary to the four character dictionary. The timing increases almost negligibly from 4 through 8 characters.

DavidC

Posted 2013-01-17T07:49:06.660

Reputation: 24 524

Can you possibly post a link to the dictionary you used so I can compare against yours? – eggonlegs – 2013-01-19T11:21:25.377

The following link to dictionary.txt should work: http://bitshare.com/files/oy62qgro/dictionary.txt.html (Sorry about the minute you'll have to wait for the download to begin.) BTW, the file has the 3char, 4char...8char dictionaries all together, 10000 words in each. You'll want to separate them.

– DavidC – 2013-01-19T14:49:00.183

Awesome. Thanks very much :) – eggonlegs – 2013-01-20T00:41:05.860

1

Haskell

Not sure about the efficiency of this, most likely rather bad. The idea is to first create all possible rotations of all the words, count the values that uniquely represent the strings and select the minimum. That way we get a number that is unique to a cyclic group.
We can group by this number and check the number of these groups.

If n is the number of words in the list and m is the length of a word then calculating the 'cyclic group number' for all the words is O(n*m), sorting O(n log n) and grouping O(n).

import Data.List
import Data.Char
import Data.Ord
import Data.Function

groupUnsortedOn f = groupBy ((==) `on` f) . sortBy(compare `on` f)
allCycles w = init $ zipWith (++) (tails w)(inits w)
wordval = foldl (\a b -> a*256 + (fromIntegral $ ord b)) 0
uniqcycle = minimumBy (comparing wordval) . allCycles
cyclicGroupCount = length . groupUnsortedOn uniqcycle

shiona

Posted 2013-01-17T07:49:06.660

Reputation: 2 889

1

This can be done in O(n) avoiding quadratic time. The idea is to construct the full circle traversing the base string twice. So we construct "amazingamazin" as the full circle string to check all cyclic strings corresponding to "amazing".

Below is the Java solution:

public static void main(String[] args){
    //args[0] is the base string and following strings are assumed to be
    //cyclic strings to check 
    int arrLen = args.length;
    int cyclicWordCount = 0;
    if(arrLen<1){
        System.out.println("Invalid usage. Supply argument strings...");
        return;
    }else if(arrLen==1){
        System.out.println("Cyclic word count=0");
        return;         
    }//if

    String baseString = args[0];
    StringBuilder sb = new StringBuilder();
    // Traverse base string twice appending characters
    // Eg: construct 'amazingamazin' from 'amazing'
    for(int i=0;i<2*baseString.length()-1;i++)
        sb.append(args[0].charAt(i%baseString.length()));

    // All cyclic strings are now in the 'full circle' string
    String fullCircle = sb.toString();
    System.out.println("Constructed string= "+fullCircle);

    for(int i=1;i<arrLen;i++)
    //Do a length check in addition to contains
     if(baseString.length()==args[i].length()&&fullCircle.contains(args[i])){
        System.out.println("Found cyclic word: "+args[i]);
        cyclicWordCount++;
    }

    System.out.println("Cyclic word count= "+cyclicWordCount);
}//main

Azee

Posted 2013-01-17T07:49:06.660

Reputation: 11

0

I don't know if this is very efficient, but this is my first crack.

private static int countCyclicWords(String[] input) {
    HashSet<String> hashSet = new HashSet<String>();
    String permutation;
    int count = 0;

    for (String s : input) {
        if (hashSet.contains(s)) {
            continue;
        } else {
            count++;
            for (int i = 0; i < s.length(); i++) {
                permutation = s.substring(1) + s.substring(0, 1);
                s = permutation;
                hashSet.add(s);
            }
        }
    }

    return count;
}

eggonlegs

Posted 2013-01-17T07:49:06.660

Reputation: 191

0

Perl

not sure i understand the problem, but this matches the example @dude posted in the comments at least. please correct my surely incorrect analysis.

for each word W in the given N words of the string list, you have to step through all characters of W in the worst case. i have to assume the hash operations are done in constant time.

use strict;
use warnings;

my @words = ( "teaks", "words", "spot", "pots", "sword", "steak", "hand" );

sub count
{
  my %h = ();

  foreach my $w (@_)
  {
    my $n = length($w);

    # concatenate the word with itself. then all substrings the
    # same length as word are rotations of word.
    my $s = $w . $w;

    # examine each rotation of word. add word to the hash if
    # no rotation already exists in the hash
    $h{$w} = undef unless
      grep { exists $h{substr $s, $_, $n} } 0 .. $n - 1;
  }

  return keys %h;
}

print scalar count(@words), $/;

ardnew

Posted 2013-01-17T07:49:06.660

Reputation: 2 177