Generate unique non-sequential string of length n without memory of previously generated strings

4

3

Scenario:

Imagine a situation where you wish to share a unique alphanumeric code. Perhaps it is a one-time-use coupon code you hand write on business cards, or some other scenario where it is desirable to have a code where mis-keying the entry is unlikely to produce valid result and the another valid code is not readily guessed.

However, when you are issuing these codes you do not have access to the list of codes that have already been issued. Rather, you know only how many have been issued before.

Challenge:

Write a function that generates a unique string of length n, subject to the following constraints:

  • Available Characters: Each character in resultant string shall be from a configurable list of allowed values, which may be "hard-coded" into the algorithm.
    • Sample List: A, B, C, D, E, F, G, H, J, K, L, M, N, P, Q, R, S, T, U, V, W, X, Z, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 (A-Z and 0-9, omitting "I" and "J")
  • Configurable Length Output: The length n of the generated string shall be configurable within the algorithm, for n<=8.
  • Input: A seed is available (unique sequential positive integer >=1 and up to 10 digits)
  • Stateless: The algorithm has no memory of previously generated strings (no database calls)
  • Repeatable: The same seed shall yield the same result
  • Unordered Output: Sequential seed values shall not generate strings in any obvious order (that is to say, results should superficially "appear" random)
    • For example, 1:"AAAA", 2:"AAAB", 3:"AAAC", ... would NOT qualify, while 1:"RR8L", 2:"1D9R", 3:"TXP3", ... is on the right track.
    • Note: While this criterion is admittedly subjective, "degree of randomness" is not intended to be the core aspect of this function. Given the Popularity Contest format, I'll leave it up to the community to decide whether a pattern is too obvious.
  • Unique Output: String shall be unique, at least within limitations of available characters and requested string length. Therefore, each possible permutation is to be used exactly once before any combination may be reused.
    • For example, if configured to generate a string of length n=5 and the available characters include A-Z and 0-9 except for "O", "I" (34 possible characters), then the string may not repeat between seed>=1 and seed<=45,435,424.
    • However, if length n=4 with 34 available characters, then the same string need only be unique on seed>=1 and seed<=1,336,336

Winning Criteria:

Popularity Contest - Submission with the most upvotes will be selected as the winner no sooner than two weeks after the first valid solution. Please upvote any and all submissions you feel adequately meet the above criteria.

Cragmonkey

Posted 2015-02-13T21:30:37.863

Reputation: 151

1Welcome to PPCG. I'm afraid your very dry spec makes me think this is homework! If it is a genuine challenge, some pointers: 1. Why an algorithm and not a program/function? 2."Code challenge" means you must specify an objective criterion for the winning answer. Other tags eg "code golf" (shortest code) have implied winning criterions. 3. why not specify the allowable characters in the question? 4. what is the maximum length (and by extension what is the maximum seed value? do all possible output strings have to be generatable?) 5. It's v. similar to some other questions, it may be a duplicate – Level River St – 2015-02-13T21:48:44.903

Forgive me, while I have visited PPCG before, I'm not yet familiar with all of the guidelines and traditions here so I certainly appreciate your feedback. No, it is not homework. The idea came up when pondering how one might generate a unique record "lookup key" where other record locators such as barcodes, serial numbers, etc. While this could easily be achieved by querying a database of previously generated lookup keys, the idea of eliminating that option posed a unique challenge I thought others may find intriguing. I didn't find other questions quite like this one, so I posted it here. – Cragmonkey – 2015-02-14T01:04:07.107

how is it supposed to be stateless but avoid using previously used strings? – feersum – 2015-02-14T01:20:33.797

@feersum what he means by stateless is that it doesn't store previously used strings. I see no problem with that bit. I have a minor issue with "appear to be random" as this is highly open to interpretation. A simple substitution cipher might suffice, (though a more diligent entry might enhance it by having the key depend on the previous character.) – Level River St – 2015-02-14T01:31:32.370

That's the main point of the challenge - to devise a way to generate a unique string without knowing what came before. The results need not be random, but simply have that appearance by not occurring in any obvious order. – Cragmonkey – 2015-02-14T01:31:49.900

@cragmonkey the reason your question was put on hold is that it lacks a winning criterion. You can edit to add one (such as code golf) and it will go to the review queue. I would suggest you ask for a program/function, not an algorithm, and nail down the other particulars such as maximum seed value and word length. My general feeling was that encryption and random number generators have been done to death, but its true I can't find your exact question and a variable base (ie number of allowed characters) does rule out many seeded rng algorithms. I dont know how this question will be received. – Level River St – 2015-02-14T01:47:26.513

2

Either way, stick around and answer a few questions to find out what we're about. Writing a good challenge is hard. Feel free to post your next challenge in the sandbox for review http://meta.codegolf.stackexchange.com/q/2140/15599

– Level River St – 2015-02-14T01:50:17.217

This now has a winning criterion, and I think it should be reopened. – KSFT – 2015-02-14T17:42:00.337

Is the seed the same as the number of strings issued before, or are they 2 separate inputs? – glebm – 2015-02-15T14:06:37.387

@glebm They're the same input. – Cragmonkey – 2015-02-15T16:37:21.290

Answers

5

C(++) - we don't need no friggin' primes

The problem is to generate a different permutation of n characters taken from an a size alphabet for each seed value.

There are an such permutations, and they can be represented as values of a n digits number in base a.

The number of bits necessary to represent all these values is log2(an), rounded up.

We have at most 8 characters per code, and the alphabet contains at most 256 characters, so a 64 bits word is just what we need to cover all possible codes with all possible alphabets.

Now we just need something that maps all possible 232 seed values to the possible range of code values (and looks random enough).

I settled for a basic 32 bits reversal, which gives highest variability for low seed values.

If the possible codes range is smaller than the seed's, there will be duplicates anyway, and the codes will be like the n first digits of a greater number.

If the 232 seed values are not enough to cover all code values, the scrambled seed is extended to the smallest power of two inferior to code max value, by injecting 0s in the lower bits.

This should spread the range of code values while guaranteeing no duplicates.
However, the last digit will have a slightly lower frequency due to the rounding to a power of 2.
If alphabet size is much greater than 2, the difference will be hardly noticeable.

The scrambled seed is then converted as a number in base a, the digits being the symbols of the alphabet.

A few (pseudo) random thoughts

The big weakness of this method is that it won't work well with alphabets whose size is a power of two. The apparent randomness is due to the non-integer number of bits needed to encode a digit.

On second thought maybe we do need some friggin primes after all...

At any rate, even the most magnificent pseudo-random repartition is useless as far as data protection is concerned, since the encryption method is stateless and could be easily cracked by brute-force.

The code

I used the excellent bit shuffling generator of Mr. Jasper Neumann to produce the bit reverser, though other solutions are possible.

I compiled with g++4.8 under MinGW/Win7, but the code is nothing but glorified C and should work with any C++ compiler.

#include <cstdlib> // atoi
#include <cstdio>  // printf
#include <cmath>   // pow

#define Alphabet "ABCDEFGHKLMNOPQRSTUVWXYZ0123456789"
#define Length 8

unsigned bit_permute_step_simple(unsigned x, unsigned m, unsigned shift)
{
  return ((x & m) << shift) | ((x >> shift) & m);
}

unsigned shuffle (unsigned x)
{
    x = bit_permute_step_simple(x, 0x55555555, 1);  // Bit index complement 0
    x = bit_permute_step_simple(x, 0x33333333, 2);  // Bit index complement 1
    x = bit_permute_step_simple(x, 0x0f0f0f0f, 4);  // Bit index complement 2
    x = bit_permute_step_simple(x, 0x00ff00ff, 8);  // Bit index complement 3
    x = bit_permute_step_simple(x, 0x0000ffff, 16);  // Bit index complement 4
    return x;
}

#define Base (sizeof(Alphabet)-1)
typedef char Code[Length+1];

void gen_code (unsigned seed, Code code)
{
    // shuffle seed bits (the inversion is to avoid zero value for seed = 0)
    unsigned long long scrambled = shuffle(~seed);

    // if maximal code size is less than seed max value, there will be duplicates anyway
    int extra_bits = floor(Length * log(Base)/log(2)) - 32;
    if (extra_bits > 0) scrambled <<= extra_bits;

    // basic base conversion.
    // The last code digit might have a slightly lower frequency than others
    int i;
    for (i = 0 ; i != Length ; i++)
    {
        int digit = scrambled % Base;
        code[i] = Alphabet[digit];
        scrambled /= Base;
    }
    code[i] = '\0';
}

int main (int argc, char * argv[])
{
    if (argc != 2) exit(-1);
    unsigned seedm = atoi(argv[1]);
    Code code;

    for (unsigned seed = 0 ; seed != seedm ; seed++)
    {
        gen_code (seed, code);
        printf ("code (%u) = %s\n", seed, code);
    }

    return 0;
}

The main() prints a specified number of codes for consecutive seeds.

Sample output:

n=8, alphabet = ABCDEFGHKLMNOPQRSTUVWXYZ0123456789

code (0) = 215GVK7B
code (1) = MPPT46CX
code (2) = 4AF2E28Y
code (3) = GLGGAS6W
code (4) = U3MSPN2Y
code (5) = CR42Y97H
code (6) = WCWB943L
code (7) = CZW2XAAB
code (8) = UBHMSV6C
code (9) = CZ0W1HCY
code (10) = WMS5BD8Z
code (11) = 8UTL725X
code (12) = MDYVMY1Z
code (13) = 40F6VM7K
code (14) = OO7E6F3M
code (15) = U5L28BA5
code (16) = 08GMUH56
code (17) = KW0W35AS
code (18) = 2HS5D16T
code (19) = ESTL9Q4R

n=4, alphabet = ABCDE

code (0) = AEBD
code (1) = CEAE
code (2) = BEDD
code (3) = DECE
code (4) = DBAB
code (5) = ACEB
code (6) = EBCB
code (7) = BCBC
code (8) = ECDE
code (9) = BDCA
code (10) = ADAA
code (11) = CDEA
code (12) = CACC
code (13) = EABD
code (14) = DAEC
code (15) = ABDD
code (16) = CDCB
code (17) = EDBC
code (18) = DDEB
code (19) = AEDC

user16991

Posted 2015-02-13T21:30:37.863

Reputation:

4

Python 3:

The main task of this challenge is to find a bijection between [0, 1, ..., #alphabet^length-1] to itself, which appears to be random.

I stumbled about linear congruential generators, which generate a permutations of the numbers, when the parameters of it satisfy some properties. But interestingly the resulting strings didn't looked random at all. So I decided to do some own stuff.

One obvious bijection is the function f(x) = (a * x + c) % m, where m = #alphabet^length, a and m are random numbers with gcd(a,m) = 1. Since the function has to be repeatable, I used the fixed values a = 0.4 * m and c = 0.3 * m. Of course this is nowhere near random.

The second bijection is based on primitive roots. When p is a prime (or 4, or (prime>2)^alpha, or 2*(prime>2)^alpha), then there exists values g, where g^i generates a permutation of all numbers. Therefore I search for next primes smaller than m and g is approximately m/3. Since p is smaller than m, not all numbers are effected by this approach, I have to use it twice, once for the first p numbers and once for the last p numbers.

My program first applies the first and than the second bijection. And the result looks quite random to me. E.g. the output for the alphabet ABC and length 3 is: BCB, ACA, BCC, BBC, CCA, ABC, CAA, ABB, CAB, AAC, AAA, BAC, BBB, CBC, BAB, ACB, BBA, BCA, CBA, CCC, AAB, ABA, CAC, BAA, ACC, CCB, CBB.

from fractions import gcd
from numbthy import is_prime, is_primitive_root

def random_linear_congruence(length, seed):
    multiplier = int(length * 0.4)
    while gcd(multiplier, length) > 1:
        multiplier += 1

    start_value = int(length * 0.3)
    return (start_value + seed*multiplier) % length

def random_primitive_root(length, seed):
    prime = length
    while not is_prime(prime):
        prime -= 1

    prim_root = prime // 3
    while not is_primitive_root(prim_root, prime):
        prim_root = (prim_root - 1 ) % prime

    random_number = seed
    if 0 < seed < prime:
        random_number = (prim_root ** seed) % prime

    t = length - prime
    if random_number > t:
        random_number = ((prim_root ** (random_number - t)) % prime) + t

    return random_number

def unique_nonsenquential_string(alphabet, length, seed):
    alphabet = sorted(alphabet)
    modulus = len(alphabet)**length

    random_number = seed
    random_number = random_linear_congruence(modulus, random_number)
    random_number = random_primitive_root(modulus, random_number)

    # convert to base len(alphabet)
    word = []
    for i in range(length):
        word.append(random_number % len(alphabet))
        random_number //= len(alphabet)

    return ''.join(map(lambda x: alphabet[x], word))

outputs = set()
alphabet = "ABC"
length = 3
for seed in range(len(alphabet)**length):
    random_string = unique_nonsenquential_string(alphabet, length, seed)
    outputs.add(random_string)
    print("Seed {:2d}: {}".format(seed, random_string))
if len(outputs) == len(alphabet)**length:
    print("Unique output")

You can find numbthy.py here: Basic Number Theory functions implemented in Python

Jakube

Posted 2015-02-13T21:30:37.863

Reputation: 21 462

-1

Simplicity (python 2)

This may not be in the spirit of the challenge, but it works, and is far simpler than the other algorithms.

import random
from string import uppercase

def unique_code(n, seed, alphabet):
    pre_state = random.getstate() #so we don't interfere outside scope
    random.seed(seed)
    code = ''.join([random.choice(alphabet) for _ in xrange(n)])
    random.setstate(pre_state)
    return code

length = 4
for i in range(20):
    print '{} [seed {:d}]'.format(unique_code(length, i, uppercase), i)

Running this code gives:

VTKG [seed 0]
DWTG [seed 1]
YYBC [seed 2]
GOJP [seed 3]
GCKE [seed 4]
QTUY [seed 5]
UVMG [seed 6]
IDQB [seed 7]
FZDS [seed 8]
MJDW [seed 9]
OLPF [seed 10]
LOYM [seed 11]
MRRD [seed 12]
GRRW [seed 13]
CSQY [seed 14]
ZATE [seed 15]
JMKL [seed 16]
NUYH [seed 17]
ERIF [seed 18]
RUNN [seed 19]

sirpercival

Posted 2015-02-13T21:30:37.863

Reputation: 1 824

3Nice idea, but it doesn't satisfy the unique output constraint. – Jakube – 2015-04-19T13:15:05.527

Can confirm. Try it with length 3 and alphabet abc, bbb occurs twice (at least on my system) – CalculatorFeline – 2017-06-19T19:40:52.953